December is here, and that means that thousands of Earth System Scientists are getting ready for the annual pilgrimage to AGU!
At my first AGU in 2011, I was a fresh-faced grad student, excited to present my latest research on modeling Southern Ocean circulation. Since then, my relationship with this massive conference has evolved a lot. First it was opportunity to reunite with old grad-school friends and catch up on the latest science. Then it became an important opportunity to help my own students and postdocs to build their networks and find their next job opportunity. Now I’m excited to return wearing a new hat: as a vendor in the exhibit hall! 🤠
The Earthmover crew will be on site for the duration of the conference. Come find us at booth 1007!
We’ll be sharing demos of Arraylake, our new data lake platform for scientific data. We’re excited to chat with anyone interested in cloud-native scientific data workflows. If you’re interested in connecting at AGU, feel free to drop me a line: firstname.lastname@example.org.
Earthmover grew out of the Pangeo Community an international open-science collaboration focused on scalable, reproducible big-data analytics. We’re thrilled that there are over 60 Pangeo-related talks happening at AGU this year! There will also be a Pangeo dinner meetup on Tuesday, Dec. 12. It’s great to see the Pangeo Community growing and thriving–we’re honored to be a part of it.
We’ll also be taking part in the conference itself, where Earthmover team members will be giving three presentations:
- Building blocks for a data informed future: data, tooling, and collaboration (Invited)
- Arraylake: A Cloud Native Data Lake Platform for Earth System Science
- CF-Xarray: Scale your analysis across datasets with less data wrangling and more metadata handling
AGU 2023 Presentation details
Tuesday, 12 December 2023 | 11:43 - 11:53 | 2020 - West
It is widely understood that weather and climate -related extreme events will increase in number and intensity in the coming decades as climate change worsens. These extreme events will put at risk many of society’s social, financial, and physical assets. The necessity to forecast and adapt to these risks has spawned its own research area, and more recently, an emerging private industry. However, academia and private industry are both faced with the same basic problem, we don’t have sufficiently detailed and accurate predictions of future climate to inform decisions at local scales. As a result, various combinations of data fusion, modeling, and analytics are performed in an attempt better forecast actionable climate risk.
What are the building blocks that are needed to create the data informed future that we need? In this presentation, we will highlight the foundational data, software, and methodologies used by many in the climate risk analytics space today. Furthermore, drawing on our experience building open source software, public cloud-hosted climate datasets, and climate risk tools, we will layout a vision for how the public and private sectors can effectively collaborate in the coming decades and how we can avoid unproductive siloing — particularly as we learn which approaches work well and which do not.
Ryan Abernathey and Joe Hamman
Wednesday, 13 December 2023 | 08:30 - 12:50 PST | Poster Hall A-C - South
The vast amount of earth system data available today is an incredible resource for understanding our planet and confronting the challenge of climate change. Traditionally, users have downloaded data to local computers for processing and analysis, but this way of working is becoming increasingly infeasible as data volumes grow. With essentially infinite compute and storage capacity, cloud computing has the potential to revolutionize our interaction with climate data, allowing everyone to bring their own compute workloads to bear against a single shared copy of the data. Over the past years, via our work in the Pangeo project, we have prototyped a cloud-native approach to climate data in the cloud, combining scalable computing technologies such as Xarray and Dask with analysis-ready, cloud-optimized (ARCO) data in formats like Zarr. While these tools show great potential, they remain difficult to deploy and use in an operational context for many scientists and institutions.
Motivated by this challenge, we founded Earthmover, a company aimed at democratizing access to state-of-the-art cloud-native data analytics, and built Arraylake, a data platform which enables teams of any size to manage and analyze climate data in the cloud. Arraylake users can access high-quality public datasets alongside their own private data, all via the high-performance Zarr data standard. This talk describes Arraylake’s architecture, novel version control system for data, and approach to supporting all common climate data formats (NetCDF, HDF5, Grib, Tiff, Zarr) via a single, user-friendly interface. Through a realistic end-to-end workflow demo, we illustrate how Arraylake helps overcome common data management challenges that have henceforth limited widespread adoption of cloud computing in earth system science.
IN34A-05 CF-Xarray: Scale your analysis across datasets with less data wrangling and more metadata handling
Deepak Cherian, Mattia Almansi, Kristen M Thyng, and Pascal Bourgault
Wednesday, 13 December 2023 | 19:40 - 19:50 | 2010 - West
There has been an explosion in the availability of terabyte to petabyte-scale geoscience datasets, particularly on the cloud, prompting the development of scalable tools and workflows to handle such big datasets by community tools such as Pangeo. There is a parallel need for tools that enable the analysis of datasets from a wide variety of sources that each have their own nomenclature. Xarray is a python package that enables easy and convenient labelled data analytics by allowing users to leverage metadata such as dimension names and coordinate labels. cf_xarray is an open-source Apache licensed Xarray extension that decodes Climate and Forecast (CF) Metadata conventions adopted by the geoscience community, allowing users to extensively use standardized metadata such as “standard names” in their analysis pipelines. For example, the zonal average of an Xarray dataset ds is seamlessly calculated as ds.cf.mean(“longitude”) on a wide variety of CF-compliant datasets, regardless of the actual name of the “longitude” variable (e.g. “lon”, “lon_rho”, “long”).
In this way, cf_xarray allows users to leverage the self-describing nature of CF-compliant datasets to ease the common task of “data wrangling” that precedes data analysis. Increasingly, cf-xarray also allows users to use metadata to write analysis pipelines that are agnostic to the specific nomenclature used in the datasets at hand.
cf_xarray also provides tools and heuristics to optionally guess absent attributes, allowing usage on incompletely tagged datasets. cf_xarray is now seeing adoption in other packages such as xESMF, a package for regridding of Xarray datasets; and NOAA’s Model Diagnostic Task Force (MDTF) diagnostic workflow for validating model simulations.
Our presentation will demonstrate the use of cf_xarray to build analysis pipelines that works on a wide variety of datasets, and describe successes and challenges with this approach.