Low Latency Icechunk ERA5 Now Available on the Earthmover Data Marketplace

TLDR: We released Icechunk-ERA5, a performance-optimized, daily updating ERA5 data cube available now in the Earthmover data marketplace. Surface + pressure levels, ACID updates, dual chunking schemes, and no limits on consumption make this the best version of ERA5 available in the cloud today. Free and open for science, backed by a commercial SLA for industry.

ERA5: The Most Useful Weather and Climate Dataset

We work with weather, climate, and environmental data users across sectors, from finance and insurance to government and academia. It’s pretty clear to us that there’s a dataset that just about everybody needs: the ECWMF Reanalysis version 5 (ERA5). ERA5 blends historical observations with the data-assimilation technology of numerical weather prediction to answer the following question: what was the weather everywhere on Earth over the past 80 years?. While it’s impossible to answer this question with perfect accuracy, ERA5 does basically the best job out there (at least until ERA6 comes out next year!) Moreover, it has become a key training and validation dataset for the latest generation of AI weather models.

Specifically, ERA5 is a gridded dataset with approximately 1/4-degree (25 km) spatial resolution and hourly temporal resolution, going all the way back to 1940. It includes everything from broadly familiar quantities like surface temperature and precipitation to more obscure meteorological fields like stratospheric potential vorticity.

There are myriad applications for ERA5. Here are five high-impact examples:

Renewable energy - wind and solar developers use multi-decade hourly wind, irradiance, and temperature fields to site farms, estimate long-term resource yield, and model grid supply variability.
Insurance & catastrophe risk - (re)insurers and cat-modelers reconstruct historical hazard footprints (wind, precipitation, temperature extremes) to price weather risk, build parametric triggers, and quantify portfolio exposure.
Agriculture & commodities - growers, agribusiness, and traders combine temperature, precipitation, soil moisture, and evaporation to drive crop-yield models, irrigation planning, and weather-driven commodity forecasting.
Climate research & model benchmarking - scientists treat reanalysis as the best gridded estimate of past atmospheric state to study climate variability and trends and to validate and train weather/climate (and ML) models.
Climate adaptation & infrastructure resilience - governments and engineers use the long, consistent record to derive design standards, flood/heat risk baselines, and downscaled scenarios for planning resilient infrastructure.

Since it’s so popular and so useful, it must also be really easy to acquire and use, right?

On one hand, yes: ERA5 is a free and open dataset which is widely interoperable with popular data science and GIS tools. On the other hand, the sheer scale of ERA5 is massive—over 10 PB of raw data! While not every use case needs all that data, many of the applications above require access to subsets ranging in size from 10 to 1000 TB. Furthermore, for high-performance AI and data analytics scenarios, the way the data are formatted and stored makes a big difference.

Options for Accessing ERA5 Today

Copernicus Climate Data Store

The official way to get ERA5 is from the Copernicus Climate Data Store of the Copernicus Climate Change Service, whose mission statement speaks for itself:

The C3S mission is to support adaptation and mitigation policies of the European Union by providing consistent and authoritative information about climate change. We offer free and open access to climate data and tools based on the best available science. We listen to our users and endeavour to help them meet their goals in dealing with the impacts of climate change.

Via the Copernicus Program, the European Union has supported the creation of many of the world’s most valuable and important environmental datasets, from Sentinel to ERA5, and made them freely available to the world. The economic impact of these efforts can be measured in the trillions, not to mention the lives saved through better environmental awareness.

The Climate Data Store comprises many elements, but most power users will be getting data from the CDS API. The CDS API is an HTTP interface to a file download queue. The user specifies the parameters of their request (which variables, which timesteps, etc), plus the format (either GRIB or NetCDF). The request goes into a queue. The user polls the API to find out when their request is ready and then downloads the file.

This works great for a single timestep from the archive. It works fine for dozens of timestep.

But what if you want 👏 every 👏 single 👏 timestep 👏 from the entire 👏 ERA5 👏 archive 👏? And what if you want to run demanding AI and analytics workloads in the cloud against the data all day and all night?

You can’t just download the data on the fly every time you access it. You need to build a mirror of ERA5 by pulling the data out of CDS and moving it into your own storage. This is what most serious teams do today. And maintaining a large replica, whether in cloud object storage or a commercial business lakehouse, represents a major storage, compute, and personnel cost.

ARCO Enters the Chat

Once you’ve committed to building your own archive, you’re no longer constrained to store the data as GRIB or NetCDF files. You might as well transform them to the format and structure that is optimal for your application.

Today, serious weather and climate teams want Analysis-Ready, Cloud-Optimized (ARCO) data. Our team coined the term ARCO in a 2021 paper. Since them, it has really caught on (see e.g. here and here )

Icechunk-ERA5 takes raw NetCDF/GRIB files, ingests them daily into a dual-chunked Icechunk data cube, and serves them to downstream workloads via Xarray and Zarr.

With ARCO data, rather than having a stack of 1000s of GRIB files, you have a single Zarr store representing the entire spatial and temporal extent of the dataset—sometimes called a data cube. A data cube is much more ergonomic for users to slice, dice, analyze and visualize. It’s also way more performant when streaming data from object storage to cloud compute nodes and GPUs. (Check our blog post on I/O-maxing tensors in the cloud for the numbers.)

While many teams build and maintain their own custom ARCO ERA5 stores via bespoke pipelines, there are also several publicly accessible mirrors of ERA5 in different shapes and flavors. Our offering adds to an already crowded list. However, as we explain below, there’s nothing out there quite like Icechunk-ERA5.

Non-CDS ERA5 Mirrors

Beyond the Copernicus CDS, ERA5 is available through a number of other mirrors and platforms:

NSF NCAR RDA / GDEX (d633000) - A complete archive in netCDF and GRIB, hosted by NSF NCAR.
NSF NCAR Curated ERA5 on AWS (nsf-ncar-era5) - The NCAR collection rehosted on S3.
Google ARCO-ERA5 (gs://gcp-public-data-arco-era5) - An analysis-ready Zarr version on Google Cloud.
DestinE Earth Data Hub (ERA5 Zarr mirror) - A cloud-native Zarr mirror served over HTTPS as part of the EU’s Destination Earth platform.
ECMWF ERA5 on AWS (era5-pds) - An early surface-variable subset on S3, now deprecated.
Microsoft Planetary Computer (ERA5-PDS) - A Zarr version of the same subset, accessed via STAC.
Google Earth Engine (era5 datasets) - ERA5 and ERA5-Land subsets available within the Earth Engine platform.

Each makes different tradeoffs across format, coverage, freshness, and access, summarized in the table below. (Spoiler alert: we also added Icechunk-ERA5 to the table.)

Source	Format	Variable coverage	Time range	Update frequency	Access model	Key limitation
Copernicus CDS	GRIB / netCDF	Full (incl. 3D)	1940–present	Daily	Request + retrieve queue	Not analysis-ready; queued, rate-limited
NCAR RDA / GDEX	netCDF-4 / GRIB	Full (incl. 3D)	1940–present	~3–4 mo	HTTPS / THREDDS / Globus	Not cloud-native; file-based
NCAR Curated on AWS	netCDF-4	Full (incl. 3D)	1940–present	~3–4 mo	S3 (open)	File-based, not Zarr
Google ARCO-ERA5	Zarr	Full (incl. 3D)	1940–present	~1 wk	GCS (open)	Monthly batches; ERA5T overwritten in place; no versioning
DestinE Earth Data Hub	Zarr (STAC)	Full (incl. 3D)	1940–present	~1 mo	HTTPS + token	monthly quota; no versioning; lossy compression; not hosted in commercial cloud
AWS ERA5-PDS	Zarr	~18 surface vars	1979–2020	Monthly (deprecated)	S3 (open)	Small subset; no longer maintained
MS Planetary Computer	Zarr (STAC)	~18 surface vars	1979–2020	Monthly	STAC + token	Same stale subset as ERA5-PDS
Google Earth Engine	Raster tiles (EE)	Surface subset (~9) + ERA5-Land	1940–present	~3 mo behind	EE API / `xee`	No 3D fields; platform dependence; commercial restrictions
Earthmover Icechunk-ERA5	Zarr / Icechunk	35 surface fields, 8 pressure-level fields	1940-present	daily	Arraylake	Login required

We surveyed existing customers in finance, insurance, and AI research to understand why they were creating their own Zarr-based ERA5 datasets rather than using an existing one from the above list. We also talked to our existing Marketplace data providers to see whether anyone was interested in taking on the task of providing an Icechunk-based ERA5. When we learned this was not on anyone else’s immediate roadmap, we decided to spec and build our own version of this essential dataset. That became Icechunk-ERA5.

Earthmover’s Icechunk ERA5

We’re pretty proud of Icechunk-ERA5. We’ve been working closely with high-powered weather data teams at major hedge funds and commodities traders to understand their usage patterns, operational requirements, and performance expectations. We’ve built the best possible ERA5 data cube for these applications, and we’re delivering it in a way that makes it viable choice for enterprises: backed by an SLA. (We also have a free version of the dataset available for more casual users.)

The Fastest Queries, No Quotas or Limits

Icechunk-ERA5 is located in AWS us-east-1, the region where many financial and insurance companies operate. There are no artificial quotas or limits on consumption. Want to run AI training pipelines 24/7 against the data? No problem. Want to build a low-latency operational service on top of the timeseries data? Great!

Not in us-east-1? Our forthcoming subscription replication engine can sync the data to your cloud and region of choice.

Daily Updates and Status Metadata

ECMWF publishes a new near-real-time update to ERA5 every day (ERA5T). Our pipeline ingests this update as soon as it becomes available (this is backed by our SLA; see below). Once the preliminary data are updated with the final data (approximately 2-3 months later), we update that too. We track all of these changes via easily accessible status flags, so users know exactly what they are working with.

Normally, mutating Zarr stores in this way would be risky or disruptive. However, here it’s perfectly safe and reliable, thanks to Icechunk’s transactions!

Icechunk for ACID Updates and Versioning

While there are several big Zarr-based ERA5 datasets out there, ours is the only that uses Icechunk as the storage engine. Icechunk turns Zarr into an ACID-compliant database. All updates are atomic—readers never see incomplete or corrupt data. Version history is serializable and auditable. These things matter for data teams with real skin in the game.

The daily updates described above, together with the temporal rechunking described below, creates a potential nightmare scenario for “standard” Zarr. Simply overwriting Zarr datasets (both chunks and metadata) while other users are accessing them can lead to error or silent corruption of downstream results. Instead, with Icechunk, each daily update is packaged into a single atomic commit.

As a bonus, Icechunk also brings best-in-class I/O performance, providing network-saturating read throughput to keep your analytics workflows and AI training pipelines humming.

Dual Chunking and Lossless PCodec Compression

Data teams agonize over choices about chunking and compression—and for good reason! These choices really impact the performance and cost of workloads that run on the data. And when you’re talking about petabytes of data, these choices are extremely hard to reverse.

For Icechunk-ERA5, we’ve drawn on a decade of experience in optimizing cloud-native tensor-data workloads. We’ve landed on a dual chunking scheme. The same data is stored in two different ways; users can choose the version that is optimal for their queries:

Spatially optimized, for queries that read the whole globe at once (AI model training, mapping, etc). Load a full variable at 1/4-degree resolution in 80 ms.
Temporally optimized, to make timeseries analysis and fly! Load a 25 year subset of hourly data in 250 ms.

One obvious downside of this is that we’ve stored the data twice. To mitigate this, we also use the state-of-the-art PCodec for compression, without applying any lossy compression techniques (e.g. bit rounding, like the Earth Data Hub ERA5.) As a result, the data in Icechunk-ERA5 is bit-for-bit identical to what you get from the CDS, but occupies ~30% less storage.

But even without this optimization, our model works because we are amortizing the storage costs over many users. Doing this right means more useful data AND lower costs for everyone.

Commercial SLA and Free Options

Having the best and fastest version of ERA5 means nothing if you’re not willing to stand behind it contractually. That’s why we’re offering a commercially licensed version of the daily-updating ERA5 data cube backed by a Service-Level Agreement (SLA). We’re committed to maintaining the data with the reliability that the most serious applications—from finance to emergency management—demand. Yes, ERA5 itself is free from the CDS. But the work that goes into running an operational pipeline to transform petabytes of NetCDF files into an optimized ARCO data cube represents a real burden, which we are happy to shoulder for our customers.

We also recognize the value and impact of free and open data for research and open science. We’re proud to also be be releasing a free version of our Icechunk ERA5 data cube! The only difference between the free and paid versions is the latency; the free version is updated every three months, rather than daily. We’ve found latency to be useful metric to disambiguate between casual research and serious operational use cases; if you need daily ERA5 updates, you probably also need an SLA.

As champions of the open data movement, we’ve thought long and hard about how to make open access to quality analysis-ready, cloud-optimized data sustainable. We’ve tried government grants. We’ve tried philanthropy. We’ve seen some of the most well meaning efforts to bring free data into the cloud peter out and become abandoned. (Remember the old ERA5 dataset in the AWS open data program?)

Here we’re doing something different. By delivering a best-in-class dataset in a way that is commercially viable for enterprise consumption, we’re building a sustainable business model around ARCO data in the cloud. This allows us to also serve the research community with a higher-latency free version of the same data.

We’re excited about this! We’re also eager for your feedback.

How to Get Started

We recommending starting with the free ERA5 on the Earthmover Data Marketplace. Just log in and start querying! You’ll get the best results running from or near us-east-1, but the data are accessible from anywhere. From the marketplace listing, click on *Direct Access to browse and explore the repo directly.

The listing README gives code snippets you can use to get started. Here’s a very quick version of the same:

from arraylake import Client
import xarray as xr

client = Client()
client.login()  # requires free Earthmover account

repo = client.get_repo("earthmover-public/era5")
session = repo.readonly_session(branch="main")
ds = xr.open_zarr(session.store,  group="single/temporal", chunks=None)

# pull a timeseries
ts = ds.t2m.sel(longitude=106, latitude=4, valid_time=slice("2000", "2025")).load()

If you like what you see and are interested in the low-latency commercial version, hit us up at sales@earthmover.io to learn more about our commercial offering.