Monthly Climatology of Many Variables¶
import xarray
import pandas
import climtas.nci
climtas.nci.GadiClient()
Client
|
Cluster
|
Our source dataset is a collection of model outputs. There’s a bit over 500 files in total, and each file has close to 200 variables. We’d like to compute a monthly climatology of this dataset for all variables present.
path = '/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.*.nc'
! ls {path} | wc -l
529
ds = xarray.open_dataset('/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.0020-01-01-00000.nc')
ds.data_vars
Data variables: (12/186)
hyam (lev) float64 ...
hybm (lev) float64 ...
hyai (ilev) float64 ...
hybi (ilev) float64 ...
P0 float64 ...
date (time) int32 ...
... ...
rad_temp (time, lev, lat, lon) float64 ...
rad_ts (time, lat, lon) float64 ...
rad_watice (time, lev, lat, lon) float64 ...
rad_watliq (time, lev, lat, lon) float64 ...
rad_watvap (time, lev, lat, lon) float64 ...
rad_zint (time, ilev, lat, lon) float64 ...
Since there’s a large number of files we’ll use all the tricks when opening them. parallel
lets multiple files be read at once, join
and compat
will read coordinates and variables from the first file if they don’t have a time dimension, and coords
and data_vars
will make sure variables are only concatenated if they have a time dimension.
ds = xarray.open_mfdataset(
path,
combine='nested',
concat_dim='time',
parallel=True,
data_vars='minimal',
join='override',
coords='minimal',
compat='override')
Trying to compute all the variables at once is certainly possible, but it will result in a giant Dask task graph. To help Dask along as much as possible we can evaluate the variables one at a time, saving each to a different file, then join them back together afterwards if needed as a post-processing step.
# Loop over all the variables
for name, var in ds.data_vars.items():
# If the variable has a time axis and is a floating point value
if 'time' in var.dims and var.dtype in ['float32','float64']:
print(name)
# Do the mean and save to file
clim = var.groupby('time.month').mean('time')
# Copy any attributes
clim.attrs = var.attrs
clim.to_netcdf(f'clim_{name}.nc')
# Stop after one variable for testing
break
co2vmr
Checking the output shows the expected climatology.
xarray.open_dataset('clim_co2vmr.nc')
<xarray.Dataset> Dimensions: (month: 12) Coordinates: * month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12 Data variables: co2vmr (month) float64 0.0002848 0.0002852 ... 0.0002849 0.000285
- month: 12
- month(month)int641 2 3 4 5 6 7 8 9 10 11 12
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
- co2vmr(month)float64...
- long_name :
- co2 volume mixing ratio
array([0.000285, 0.000285, 0.000285, 0.000286, 0.000285, 0.000285, 0.000284, 0.000283, 0.000283, 0.000284, 0.000285, 0.000285])
Batch Job¶
When doing the full analysis we should use a batch job rather than Jupyter. To do this just use the calculations from the notebook, wrapping it all in a if __name__ == '__main__'
check so that Dask starts up properly.
import xarray
import climtas.nci
if __name__ == '__main__':
# Start a Dask client using the resouces from qsub
climtas.nci.GadiClient()
# Open the files
path = '/scratch/y99/dd7103/PORT_Apr22/Base_for_PORT_conApr21/Base_for_PORT_conApr21.cam.h1.00*.nc'
ds = xarray.open_mfdataset(path, combine='nested', concat_dim='time', parallel=True,
data_vars='minimal', join='override', coords='minimal', compat='override')
# Loop over all the variables
for name, var in ds.data_vars.items():
# If the variable has a time axis and is a floating point value
if 'time' in var.dims and var.dtype in ['float32','float64']:
# Do the mean and save to file
clim = var.groupby('time.month').mean('time')
# Copy any attributes
clim.attrs = var.attrs
clim.to_netcdf(f'clim_{name}.nc')