Preprocessing#
NetCDF4 files, commonly used for storing climate and earth systems data, are not optimized for use with most machine learning applications with heavy io requirements or datasets that are simply too large to hold in GPU/CPU memory. ClimatExML uses nc2pt to resolve this issue. It performs a preprocessing flow on climate fields and converts them from NetCDF4 (.nc) to an intermediate file format Zarr (.zarr) which allows for the parallel loading and writing to individual PyTorch Lightning files (.pt) that can be loaded directly onto GPUs. The remainder of this document describes the nc2pt workflow and installation procedure.
What intended use cases of nc2pt?#
standardizing and making metadata uniform between datasets
aligns different grids perfectly by re-projecting them onto one another β nc2pt projects the low-resolution (lr) regular grids onto the high-resolution curvilinear grids (hr). nc2pt assumes the curvilinear dimensions are like
rlatorrlon. It was originally designed to support super-resolution problems.selects individual years as test years or training years
organizes code into input (lr) or output (hr) fields
meant for use with large datasets ont he order of hundreds of GB
What preprocessing steps does nc2pt do? π€#
High-level workflow

configures metadata between the datasets as defined in the config
slices data to a pre-determined range of dates
aligns the grids via interpolation, crops them to be the same size, and coarsens the low-resolution fields by the configured scale factor
applies user defined transforms like unit conversions or log transformations
splits into a train and test dataset and standardizes both datasets based on the mean and standard deviation of all grids from the training data only (also writes this information into the zarr metadata for inference)
writes to
.zarrnc2pt/tools/zarr_to_torch.py- writes to PyTorch filesnc2pt/tools/single_file_to_batches.py- batches the single PyTorch files
What are the downsides of using PyTorch files for climate data?#
The most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.
Requirements#
π½ Installing nc2pt#
As part of the preprocessing pipeline, xESMF is used for regridding. However, since xESMF is only available through Conda, you will have to be able to install conda on your system. Unfortunately, this is limiting because certain HPCs donβt allow conda.
Begin by install xESMF here in a conda environment: xESMF
Clone the nc2pt repository
Install into your conda environment
conda install -c conda-forge pip
pip install -r requirements.txt
# editable install
pip install -e nc2pt/
Thatβs it!
π Configuration#
nc2pt uses hydra for configuring and by instantiating structured classes in nc2pt/climatedata.py. This simeultaneously defines the workflow as well as the data. Please see nc2pt/conf/config.yml for an example configuration, or below:
_target_: nc2pt.climatedata.ClimateData # Iniatlizes ClimateData dataclass object
output_path: /home/nannau/data/proc/
climate_models:
# This lists the models
- _target_: nc2pt.climatedata.ClimateModel
name: hr
info: "High Resolution USask WRF, Western Canada"
climate_variables: # Provides a list of ClimateVariable dataclass objects to initialize
- _target_: nc2pt.climatedata.ClimateVariable
name: "tas"
alternative_names: ["T2", "surface temperature"]
path: /home/nannau/USask-WRF-WCA/fire_vars/T2/*.nc
is_west_negative: true
apply_standardize: false
apply_normalize: true
invariant: false
transform: []
- _target_: nc2pt.climatedata.ClimateModel
info: "Low resolution ERA5, Western Canada"
name: lr
hr_ref: # Reference field to interpolate to. Will need to provide new file if not using USask WRF
_target_: nc2pt.climatedata.ClimateVariable
name: "hr_ref"
alternative_names: ["T2"]
path: nc2pt/nc2pt/data/hr_ref.nc
is_west_negative: true
climate_variables:
- _target_: nc2pt.climatedata.ClimateVariable
name: "tas"
alternative_names: ["T2", "surface temperature"]
path: /home/nannau/ERA5_NCAR-RDA_North_America/proc/tas_1hr_ERA5_an_RDA-025_1979010100-2018123123_time_sliced_cropped.nc
is_west_negative: false
apply_standardize: false
apply_normalize: true
invariant: false
transform:
- "x - 273.15"
dims: # Defines the dimensions you might find in your lr or hr dataset and lists them to be initialized as ClimateDimension objects. Typically this would match what is in your hr dataset. Intended to allow for renaming of dimensions and allows for the control of chunking
- _target_: nc2pt.climatedata.ClimateDimension
name: time
alternative_names: ["forecast_initial_time", "Time", "Times", "times"]
chunksize: 100
- _target_: nc2pt.climatedata.ClimateDimension
name: rlat
alternative_names: ["rotated_latitude"]
hr_only: true
chunksize: -1
- _target_: nc2pt.climatedata.ClimateDimension
name: rlon
alternative_names: ["rotated_longitude"]
hr_only: true
chunksize: -1
# similar to dims, just as coodinates instead. coordinates might not match dims on curvilinear grids
coords:
- _target_: nc2pt.climatedata.ClimateDimension
name: lat
alternative_names: ["latitude", "Lat", "Latitude"]
chunksize: -1
- _target_: nc2pt.climatedata.ClimateDimension
name: lon
alternative_names: ["longitude", "Long", "Lon", "Longitude"]
chunksize: -1
# subsample data temporally or spatially!
select:
# Time indexing for subsets
time:
# Crop to the dataset with the shortest run
# this defines the full dataset from which to subset
range:
start: "20001001T06:00:00"
end: "20150928T12:00:00"
# start: "2021-11-01T00:00:00"
# end: "2021-12-31T22:00:00"
# use this to select which years to reserve for testing
# and for validation
# the remaining years in full will be used for training
test_years: [2000, 2009, 2014]
validation_years: [2015]
# test_years: [None]
# validation_years: [None]
# sets the scale factor and index slices of the rotated coordinates
spatial:
scale_factor: 8
x:
first_index: 110
last_index: 622
y:
first_index: 20
last_index: 532
# dask client parameters
compute:
# xarray netcdf engine
engine: h5netcdf
dask_dashboard_address: 8787
chunks:
time: auto
rlat: auto
rlon: auto
# optional for tools scripts (single_files_to_batches)
loader:
batch_size: 4
randomize: true
seed: 0
π Running#
Explore data and ensure compatibility
Configure
nc2pt/conf/config.yamlRun the
nc2pt/preprocess.pyscript which will run through your preprocessing steps. This creates the zarr filesRun the
nc2pt/tools/zarr_to_torch.pyscript which serializes each time step in the.zarrfile to an individual PyTorch.ptfile.Optional: run the
nc2pt/tools/single_files_to_batches.pywhich combines individual files from the previous step into random batches. This setup allows for less io in your machine learning pipeline.
Testing#
Testing is done with pytest. The easiest way to perform tests is to install pytest and use the command: pytest --cov-report term-missing --cov=nc2pt .
It will generate a coverage report and automatically use files prepended with test_*.py in nc2pt/tests