Monthly Climatology of Many Variables

import xarray
import pandas
import climtas.nci

climtas.nci.GadiClient()

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 17.18 GB

Our source dataset is a collection of model outputs. There’s a bit over 500 files in total, and each file has close to 200 variables. We’d like to compute a monthly climatology of this dataset for all variables present.

path = '/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.*.nc'
! ls {path} | wc -l
529
ds = xarray.open_dataset('/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.0020-01-01-00000.nc')
ds.data_vars
Data variables: (12/186)
    hyam           (lev) float64 ...
    hybm           (lev) float64 ...
    hyai           (ilev) float64 ...
    hybi           (ilev) float64 ...
    P0             float64 ...
    date           (time) int32 ...
    ...             ...
    rad_temp       (time, lev, lat, lon) float64 ...
    rad_ts         (time, lat, lon) float64 ...
    rad_watice     (time, lev, lat, lon) float64 ...
    rad_watliq     (time, lev, lat, lon) float64 ...
    rad_watvap     (time, lev, lat, lon) float64 ...
    rad_zint       (time, ilev, lat, lon) float64 ...

Since there’s a large number of files we’ll use all the tricks when opening them. parallel lets multiple files be read at once, join and compat will read coordinates and variables from the first file if they don’t have a time dimension, and coords and data_vars will make sure variables are only concatenated if they have a time dimension.

ds = xarray.open_mfdataset(
    path,
    combine='nested',
    concat_dim='time',
    parallel=True,
    data_vars='minimal',
    join='override',
    coords='minimal',
    compat='override')

Trying to compute all the variables at once is certainly possible, but it will result in a giant Dask task graph. To help Dask along as much as possible we can evaluate the variables one at a time, saving each to a different file, then join them back together afterwards if needed as a post-processing step.

# Loop over all the variables
for name, var in ds.data_vars.items():
    # If the variable has a time axis and is a floating point value
    if 'time' in var.dims and var.dtype in ['float32','float64']:
        print(name)
        
        # Do the mean and save to file
        clim = var.groupby('time.month').mean('time')

        # Copy any attributes
        clim.attrs = var.attrs

        clim.to_netcdf(f'clim_{name}.nc')
        
        # Stop after one variable for testing
        break
co2vmr

Checking the output shows the expected climatology.

xarray.open_dataset('clim_co2vmr.nc')
<xarray.Dataset>
Dimensions:  (month: 12)
Coordinates:
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    co2vmr   (month) float64 0.0002848 0.0002852 ... 0.0002849 0.000285

Batch Job

When doing the full analysis we should use a batch job rather than Jupyter. To do this just use the calculations from the notebook, wrapping it all in a if __name__ == '__main__' check so that Dask starts up properly.

import xarray
import climtas.nci

if __name__ == '__main__':
    # Start a Dask client using the resouces from qsub
    climtas.nci.GadiClient()

    # Open the files
    path = '/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.00*.nc'
    ds = xarray.open_mfdataset(path, combine='nested', concat_dim='time', parallel=True,
            data_vars='minimal', join='override', coords='minimal', compat='override')

    # Loop over all the variables
    for name, var in ds.data_vars.items():
        # If the variable has a time axis and is a floating point value
        if 'time' in var.dims and var.dtype in ['float32','float64']:
            # Do the mean and save to file
            clim = var.groupby('time.month').mean('time')

            # Copy any attributes
            clim.attrs = var.attrs

            clim.to_netcdf(f'clim_{name}.nc')