Latest Posts

This post describes the fundamentals of Earth-Observation datacubes, outlines the basic Python building blocks for creating Zarr-backed datacubes, and presents a scalable serverless approach to building large-scale datacubes which is cost-effective, reliable, and performant.

This is a blog version of a webinar that took place on April 16, 2024. Here’s a video of that webinar: Earth Observation satellites generate massive volumes of data about our planet, and these data are vital for confronting global challenges. Satellite imagery is commonly distributed as individual “scenes” — a single file consisting of a single image of a tiny part of the Earth. Popular public satellite programs such such as NASA / USGS Landsat and Copernicus Sentinel produce millions of such images a year, comprising petabytes of data. Increasingly, we see organizations looking to aggregate raw satellite imagery into more analysis-ready datacubes. In contrast to millions of individual images sampled unevenly in space and time, Earth-system datacubes contain multiple variables, align…
Read More

A status update on the development of Zarr-Python 3.

Note: This post was originally published on the Zarr developer blog. We released Zarr-Python 2.18.0 this week. Although this release was quite light in terms of user-facing changes, it represents the beginning of a new phase for the project. In this post, we’ll walk through our plan for Zarr-Python 3.0 and what users of the library can expect in the coming months. Zarr-Python 2.18 Before we get into the 3.0 release, we’ll first cover a few details about the 2.18 release series. The first thing to know is that we will continue to support 2.18 with bug fixes up until the release of 3.0. Additionally, we expect to use the 2.18 series to communicate changes in the Zarr-Python API, which will come in 3.0. For example, this week’s release included a number of new deprecation warnings for part…
Read More

How Arraylake transformed Sylvera's data system.

Situation Overview Sylvera rates projects in the voluntary carbon market with the goal of enabling their customers to invest in the most meaningful initiatives. In order to produce these ratings, Sylvera relies on satellite imagery from providers such as Copernicus, USGS, and NASA. Prior to adopting Arraylake, the engineering team downloaded data across multiple geotiff files stored on individual machines and ran algorithms on the data in a local environment. This process worked for a time but as they began to scale they realized their workflow was not viable. They needed a modern platform to help them manage data and collaborate more effectively. Solutions Assessment As Sylvera looked to improve their data pipeline, they analyzed 3 solutions: One was building a tool in house, which, …
Read More

We set up a high-performance PyTorch dataloader using data stored as Zarr in the cloud

Machine learning has become essential in the utilization of weather, climate, and geospatial data. Sophisticated models such as GraphCast, ClimaX, and Clay are emerging within these domains. The advancement of these models is greatly influenced by the widespread availability of cloud computing resources, particularly GPUs, and the abundance of data stored in cloud repositories. Despite these advancements, there remains a lack of established best practices for efficiently managing machine learning training pipelines due to the diverse range of data formats used when storing scientific data. In this blog post, we discuss an architecture that we have found highly effective in seamlessly integrating multidimensional arrays from cloud storage into machine learning frameworks. The problem At …
Read More