This is a model for how to build a data pipeline that ingests and transforms weather forecast data distributed as GRIB files to an analysis-ready cloud-optimized Zarr data cube using a serverless pipeline.
Currently two models are supported : GFS and HRRR, extensions to other models should be easy™.
Warning
This repo is for demonstration purposes only. It does not aspire to be a maintained package. If you want to build on top of it, fork this repo and modify it to your needs.
The code runs in one of two modes:
backfill: Initialize a new Zarr store or Arraylake repo. Ingest data for the time period betweensinceand (optionally)till.update: Designed to run as a scheduled Cron job, this mode brings a Zarr store up-to-date with the latest available data.
Modal Cron Jobs cannot take arguments, so any configuration must be shipped up when running modal deploy. To configure the data pipeline
set up a TOML file in src/configs/. The location is arbitrary, but it's nice to keep it all in one place.
- All writes are isolated to a new temporary branch.
- Once all writes succeed, a snapshot is made to that branch.
- Then
mainis reset to this new snapshot. - Finally the temporary branch is deleted.
# Drive from the command line.
modal run modal_hrrr.py --mode "backfill" --since "2024-05-15" --toml-file src/configs/hrrr-demo.toml
# Directly specify parameters `since`, `till`, `toml_file_path` in hrrr_backfill.
modal run modal_hrrr.py::hrrr_backfill
# Set up a repeating cron job to update the store.
# Both hrrr_update_solar and hrrr_verify are deployed.
modal deploy modal_hrrr.pyDetails of the pipeline are configured using a TOML file.
# Format
# ------
[arbitrary_job_name]
## Three Herbie Parameters
model = string
product = string
search = string
## Two Zarr store Parameters
store = string e.g. s3://my-bucket/store.zarr
zarr_group = string e.g. "sfc/fcst"
## Schema
chunks = {string: int}, integer chunk size for dimension
renames = {str: str}, mapping from variable name in inventory to actual name when read with cfgrib
Here's a concrete example:
[job1]
model = "hrrr"
product = "sfc"
searches = [
"(?:TMP|RH):2 m above ground|(?:GUST|DSWRF|PRATE):surface|TCDC:entire atmosphere",
]
store = "arraylake://earthmover-demos/hrrr"
zarr_group = "solar/"
chunks = {x = 360, y = 120, time = 1, step = 19}
renames = {TMP="t2m", RH="r2", TCDC="tcc"}- Set
modelandproductas necessary to haveherbiefind the right variables. - Set the output location using
storeandzarr_groupwithin the store. For example,storecan bes3://my-bucket/forecast-datacube.zarr. chunksspecify chunking for the Zarr arrays.- To set up the
searcheswe recommend iterating interactively withFastHerbie.inventory(search=searches)and making sure you see all data that's needed. - The
renamesfield is harder. It is necessary becausesearches(andherbie) will only know variable names as written in the.idxsidecar files. The variable names are not necessarily preserved when reading those GRIB files withcfgrib. Annoyingly, we do not know a priori what namescfgribwill choose to assign. Again the best approach is to iterate in a notebook. Alternatively, simply run the pipeline withrenames={}, and an error will raised suggesting what to set.
For more examples see src/configs/.
hrrr-cube.ipynb: Notebook demonstrating analysis with the HRRR data cube.modal_app.py: Core functions annotated to run with Modal.- Model-specific functions.
modal_hrrr.py,modal_gfs.py- These are simple specializations for a couple of models: HRRR and GFS. They have been separated out for convenience.
src/:lib.py: Core data structures and utilities. The most important data structure isForecastModel. This is the base class that allows specialization to a specific model.gfs.py: ContainsGFS, a subclass ofForecastModel, specialized for GFS output.hrrr.py: ContainsHRRR, a subclass ofForecastModel, specialized for HRRR output.
- Make sure that the
searchstring returns what you want. It is a good idea to useFastHerbie.inventory(search_string)to double check. cfgriblikes to rename variables so what's in the dataset doesn't match what's in thesearchstring. Please specifyrenamesas a dictionary that maps variable name in the GRIB inventory file to variable name set bycfgrib.- Multiple
searchesare not supported yet. - The combination of
zarr_store&groupcannot be repeated in the TOML file. - Modal functions with schedules cannot take arguments. So any configuration
tomlfiles must be uploaded duringmodal deployby bundling them insrc/configs/.