Situation Overview
Sylvera rates projects in the voluntary carbon market with the goal of enabling their customers to invest in the most meaningful initiatives. In order to produce these ratings, Sylvera relies on satellite imagery from providers such as Copernicus, USGS, and NASA.
Prior to adopting Arraylake, the engineering team downloaded data across multiple geotiff files stored on individual machines and ran algorithms on the data in a local environment. This process worked for a time but as they began to scale they realized their workflow was not viable. They needed a modern platform to help them manage data and collaborate more effectively.
Solutions Assessment
As Sylvera looked to improve their data pipeline, they analyzed 3 solutions:
- One was building a tool in house, which, after scoping, was clear would take up undue developer time that could better be spent serving their core business.
- The second was a generic array database provider that would have required them to build their own layer on top to analyze their geospatial data.
- The third was Arraylake.
They selected Arraylake because it integrated seamlessly with the open source Python ecosystem (Xarray, Zarr) and could easily connect with their pipelines, which were written in Python.
Evaluation & Implementation
Since Sylvera’s pipelines were written in Python, to get up and running they simply had to update their pipelines to point to an Arraylake Zarr store. They participated in an evaluation with Earthmover during which time they connected an existing pipeline to Arraylake, queried the data, and were able to skip the redownloading step.
From testing Arraylake on this small subset of data, it became clear that this new pipeline was much quicker than with their previous workflow. Rather than storing millions of individual geotiffs, they could address all of their data as a single massive array spanning both space and time dimensions.
The Solution - Arraylake
Sylvera implemented Arraylake across all of their data following the successful evaluation. They currently use Arraylake to incrementally populate planetary scale raster datasets. To do this, they declare a global empty dataset and ingest data on demand as needed using a Dagster pipeline.
The ability to work incrementally has allowed them to reduce the compute power needed to ingest and analyze their data. Sylvera has also found Arraylake’s versioning critical to their business so they can see a historical log of data updates to comply with legal and regulatory audits.
From Sylvera’s team:
“As Sylvera has scaled, our geospatial data has grown more complex. Prior to Arraylake, it was scattered as millions of individual files across multiple machines, making it tricky to manage and access. Arraylake’s focus on cloud object storage has allowed us to centralize our data, to standardize our access patterns, and to make it accessible across all environments from local dev machines to automated pipelines in the cloud. This is a necessary step for the business to be able to grow and leverage its data effectively.”
Freddie Ruxton Lead Platform Engineer, Earth Data Platform Team
In addition to the product, the Sylvera team has derived tremendous value from the Earthmover team.
“Part of what we were buying into [when purchasing Arraylake] is the team are the experts in the space [who] know what the big problems to be solved are to unlock the power of Zarr and geocloud computing.”
Freddie Ruxton Lead Platform Engineer, Earth Data Platform Team
Throughout the partnership, Earthmover has been extremely responsive to Sylvera’s needs and continues to release improvements to the product.
“The Earthmover team is working faster than we can diagnose things on our end.”
Daniel Jahn Platform Engineer II, Earth Data Platform Team