Loading Ensemble Members

The C20C dataset contains files split by both year and ensemble member. Let’s load them with Xarray and create an ensemble mean.

ls /g/data/ua8/C20C/v3/member_monthly/TMPS/*/*_mem*.nc | head -4
/g/data/ua8/C20C/v3/member_monthly/TMPS/1850/TMPS.1850.mnmean_mem001.nc
/g/data/ua8/C20C/v3/member_monthly/TMPS/1850/TMPS.1850.mnmean_mem002.nc
/g/data/ua8/C20C/v3/member_monthly/TMPS/1850/TMPS.1850.mnmean_mem003.nc
/g/data/ua8/C20C/v3/member_monthly/TMPS/1850/TMPS.1850.mnmean_mem004.nc
ls: write error: Broken pipe
import xarray
from tqdm.auto import tqdm
import climtas.nci

climtas.nci.GadiClient()

Client

Cluster

  • Workers: 2
  • Cores: 2
  • Memory: 8.59 GB

We can use a loop to load each member individually, storing the members’ data in the array dss. tqdm() here just adds a progress bar, so we can see where the load has gotten to.

One of the files in this dataset has latitude values slightly different to the others, from inspection of the file they’re just stored at a higher precision. I’ve used join='override' to take the lat and lon coordinate values from the first file.

dss = []
for mem in tqdm(range(1,81)):

    path = f'/g/data/ua8/C20C/v3/member_monthly/TMPS/*/TMPS.*.mnmean_mem{mem:03d}.nc'
    ds = xarray.open_mfdataset(
        path,
        combine='nested',
        concat_dim='time',
        join='override',
        coords='minimal',
        parallel=True,
    )
    dss.append(ds)

With all the files read, we can combine them along a new ‘member’ dimension with xarray.concat.

ds = xarray.concat(dss, dim='member')

A quick check of the input chunking - this is smaller than we’d like at only 6 MB, however that is a limitation of the input data - this is monthly data with one year per file, so only 12 timesteps in each file.

ds.TMP
<xarray.DataArray 'TMP' (member: 80, time: 1992, lat: 256, lon: 512)>
dask.array<concatenate, shape=(80, 1992, 256, 512), dtype=float32, chunksize=(1, 12, 256, 512), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T10:30:00 ... 2015-12-16T10:30:00
  * lon      (lon) float32 0.0 0.703 1.406 2.109 ... 357.1 357.8 358.5 359.2
  * lat      (lat) float32 89.46 88.77 88.07 87.37 ... -88.07 -88.77 -89.46
Dimensions without coordinates: member
Attributes:
    standard_name:       air_temperature
    long_name:           Temperature
    units:               K
    param:               0.0.0
    realization:         1
    ensemble_members:    10
    forecast_init_type:  3
    original_name:       t

The large number of tasks means that Dask can take a while to set up everything before it starts to process the data.

Another method of doing this which can cut down on the number of files loaded at once and so reducing the number of tasks is to work on the files a group at a time - say create the mean for each decade in a different file. A function is helpful to let us do the same operation multiple times.

def mean_decade(year):
    dss = []
    for mem in range(1,81):

        path = f'/g/data/ua8/C20C/v3/member_monthly/TMPS/{year//10}?/TMPS.*.mnmean_mem{mem:03d}.nc'
        ds = xarray.open_mfdataset(
            path,
            combine='nested',
            concat_dim='time',
            join='override',
            coords='minimal',
            parallel=True,
        )
        dss.append(ds)
        
    ds = xarray.concat(dss, dim='member')
    
    ds.TMP.mean('member').to_netcdf(f'/scratch/w35/saw562/C20C_TMP_memmean_{year}.nc')
%%time

for year in tqdm(range(1850, 2020, 10)):
    mean_decade(year)
CPU times: user 3min 12s, sys: 11.4 s, total: 3min 24s
Wall time: 10min 53s