Analysis-Ready Weather Forecast Data Cubes in Icechunk

This is a model for how to build a data pipeline that ingests and transforms weather forecast data distributed as GRIB files to an analysis-ready cloud-optimized Zarr data cube using a serverless pipeline.

Currently two models are supported : GFS and HRRR, extensions to other models should be easy™.

Warning

This repo is for demonstration purposes only. It does not aspire to be a maintained package. If you want to build on top of it, fork this repo and modify it to your needs.

The code runs in one of two modes:

backfill : Initialize a new Zarr store or Arraylake repo. Ingest data for the time period between since and (optionally) till.
update : Designed to run as a scheduled Cron job, this mode brings a Zarr store up-to-date with the latest available data.

Modal Cron Jobs cannot take arguments, so any configuration must be shipped up when running modal deploy. To configure the data pipeline set up a TOML file in src/configs/. The location is arbitrary, but it's nice to keep it all in one place.

Approach

All writes are isolated to a new temporary branch.
Once all writes succeed, a snapshot is made to that branch.
Then main is reset to this new snapshot.
Finally the temporary branch is deleted.

Execution

# Drive from the command line.
modal run modal_hrrr.py --mode "backfill" --since "2024-05-15"  --toml-file src/configs/hrrr-demo.toml

# Directly specify parameters `since`, `till`, `toml_file_path` in hrrr_backfill.
modal run modal_hrrr.py::hrrr_backfill

# Set up a repeating cron job to update the store.
# Both hrrr_update_solar and hrrr_verify are deployed.
modal deploy modal_hrrr.py

Configuration

Details of the pipeline are configured using a TOML file.

# Format
# ------
[arbitrary_job_name]
## Three Herbie Parameters
model = string
product = string
search = string
## Two Zarr store Parameters
store = string  e.g. s3://my-bucket/store.zarr
zarr_group = string e.g. "sfc/fcst"
## Schema
chunks = {string: int}, integer chunk size for dimension
renames = {str: str}, mapping from variable name in inventory to actual name when read with cfgrib

Here's a concrete example:

[job1]
model = "hrrr"
product = "sfc"
searches = [
  "(?:TMP|RH):2 m above ground|(?:GUST|DSWRF|PRATE):surface|TCDC:entire atmosphere",
]
store = "arraylake://earthmover-demos/hrrr"
zarr_group =  "solar/"
chunks = {x = 360, y = 120, time = 1, step = 19}
renames = {TMP="t2m", RH="r2", TCDC="tcc"}

Set model and product as necessary to have herbie find the right variables.
Set the output location using store and zarr_group within the store. For example, store can be s3://my-bucket/forecast-datacube.zarr.
chunks specify chunking for the Zarr arrays.
To set up the searches we recommend iterating interactively with FastHerbie.inventory(search=searches) and making sure you see all data that's needed.
The renames field is harder. It is necessary because searches (and herbie) will only know variable names as written in the .idx sidecar files. The variable names are not necessarily preserved when reading those GRIB files with cfgrib. Annoyingly, we do not know a priori what names cfgrib will choose to assign. Again the best approach is to iterate in a notebook. Alternatively, simply run the pipeline with renames={}, and an error will raised suggesting what to set.

For more examples see src/configs/.

Organization

hrrr-cube.ipynb : Notebook demonstrating analysis with the HRRR data cube.
modal_app.py : Core functions annotated to run with Modal.
Model-specific functions.
- modal_hrrr.py, modal_gfs.py
- These are simple specializations for a couple of models: HRRR and GFS. They have been separated out for convenience.
src/:
1. lib.py: Core data structures and utilities. The most important data structure is ForecastModel. This is the base class that allows specialization to a specific model.
2. gfs.py : Contains GFS, a subclass of ForecastModel, specialized for GFS output.
3. hrrr.py : Contains HRRR, a subclass of ForecastModel, specialized for HRRR output.

Sharp edges

Make sure that the search string returns what you want. It is a good idea to use FastHerbie.inventory(search_string) to double check.
cfgrib likes to rename variables so what's in the dataset doesn't match what's in the search string. Please specify renames as a dictionary that maps variable name in the GRIB inventory file to variable name set by cfgrib.
Multiple searches are not supported yet.
The combination of zarr_store & group cannot be repeated in the TOML file.
Modal functions with schedules cannot take arguments. So any configuration toml files must be uploaded during modal deploy by bundling them in src/configs/.
- See https://herbie.readthedocs.io/en/stable/user_guide/tutorial/search.html for more

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
hrrr-cube.ipynb		hrrr-cube.ipynb
modal_gfs.py		modal_gfs.py
modal_hrrr.py		modal_hrrr.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis-Ready Weather Forecast Data Cubes in Icechunk

Approach

Execution

Configuration

Organization

Sharp edges

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

earth-mover/forecast-datacube-demo

Folders and files

Latest commit

History

Repository files navigation

Analysis-Ready Weather Forecast Data Cubes in Icechunk

Approach

Execution

Configuration

Organization

Sharp edges

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages