For brevity's sake, we'll extract a shorter time series and only a few data variables.
Click to show code
import xarray as xr
uri = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_ds_sub = (
# Open the dataset
xr.open_zarr(uri, chunks={"time": 48}, consolidated=True)
# Select the near-surface level
.isel(level=0, drop=True)
# subset in time
.sel(time=slice("2017-01", "2018-01"))
# reduce to two arrays
[["2m_temperature", "u_component_of_wind"]]
)
era5_ds_sub
<xarray.Dataset> Size: 13GB Dimensions: (time: 1584, latitude: 721, longitude: 1440) Coordinates: * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.75 -90.0 * longitude (longitude) float32 6kB 0.0 0.25 0.5 ... 359.5 359.8 * time (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31... Data variables: 2m_temperature (time, latitude, longitude) float32 7GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray> u_component_of_wind (time, latitude, longitude) float32 7GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
Here's an image of the "2m_temperature' variable at a single time-step.
(era5_ds_sub.isel(time=-2)["2m_temperature"].plot(robust=True, figsize=(10, 6)))
<matplotlib.collections.QuadMesh at 0x7f0a6a52de80>
Read vector data¶
ERA5 outputs are gridded datasets covering the entire globe. We only need data about cities in Europe, so continuing to store the global dataset is a pain and unnecessary. We use a vector dataset from Hugging Face and format it as a gpd.GeoDataFrame
containing data covering the European continent.
Click to show code
import geopandas as gpd
import pandas as pd
cities_df = pd.read_json(
"hf://datasets/jamescalam/world-cities-geo/train.jsonl", lines=True
)
cities_eur = cities_df.loc[cities_df["continent"] == "Europe"]
cities_eur = gpd.GeoDataFrame(
cities_eur,
geometry=gpd.points_from_xy(cities_eur.longitude, cities_eur.latitude),
crs="EPSG:4326",
).drop(["latitude", "longitude", "x", "y", "z"], axis=1)
cities_eur.head()
city | country | region | continent | geometry | |
---|---|---|---|---|---|
84 | Tirana | Albania | Southern Europe | Europe | POINT (19.81889 41.32750) |
85 | Durres | Albania | Southern Europe | Europe | POINT (19.44139 41.32306) |
86 | Elbasan | Albania | Southern Europe | Europe | POINT (20.08222 41.11250) |
87 | Vlore | Albania | Southern Europe | Europe | POINT (19.48972 40.46667) |
88 | Shkoder | Albania | Southern Europe | Europe | POINT (19.51258 42.06828) |
Sampling a raster data cube using vector geometries¶
We've created an vector dataframe (cities_eur
) storing data about European cities. This object has the geographic information needed to subset the ERA5 dataset era5_ds_sub
. We now need to extract the points from era5_ds_sub
that align with the city
geometry dimension in cities_eur
. We can use the Xvec method xvec.extract_points() for this operation.
import xvec
era5_europe_cities = era5_ds_sub.xvec.extract_points(
cities_eur.geometry, x_coords="longitude", y_coords="latitude"
).drop_vars("index")
era5_europe_cities
<xarray.Dataset> Size: 36MB Dimensions: (time: 1584, geometry: 2844) Coordinates: * time (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31... * geometry (geometry) object 23kB POINT (19.8188896 41.3275) ..... Data variables: 2m_temperature (time, geometry) float32 18MB dask.array<chunksize=(28, 2844), meta=np.ndarray> u_component_of_wind (time, geometry) float32 18MB dask.array<chunksize=(28, 2844), meta=np.ndarray> Indexes: geometry GeometryIndex (crs=EPSG:4326)
Cool, now we have a 2-dimensional vector data cube! We went from a 3-d raster datacube with [time, latitude, longitude]
dimensions to a 2-d datacube [time, geometry]
where the only spatial dimension is 'geometry'.
Up until now, we have been performing lazy operations, meaning that we haven't actually loaded the ERA5 data we accessed from Google Cloud Storage into memory on our local machine. We'll do this now so that we can perform computations that require the data in memory.
from dask.diagnostics import ProgressBar
with ProgressBar():
era5_europe_cities = era5_europe_cities.load()
era5_europe_cities
[########################################] | 100% Completed | 223.03 s
<xarray.Dataset> Size: 36MB Dimensions: (time: 1584, geometry: 2844) Coordinates: * time (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31... * geometry (geometry) object 23kB POINT (19.8188896 41.3275) ..... Data variables: 2m_temperature (time, geometry) float32 18MB 271.9 278.4 ... 274.6 u_component_of_wind (time, geometry) float32 18MB 112.1 112.7 ... nan nan Indexes: geometry GeometryIndex (crs=EPSG:4326)
Using a vector data cube¶
We've shown how to combine raster and vector data to create a vector data cube with two indexes (geometry and time). Now, let's use it!
1. Spatial indexing¶
We can select points based on geometry and proximity to other points using the xvec.query()
method. As an example, find all cities within 0.5 degree of the city with the highest single temperature estimate:
max_temp_city = era5_europe_cities.isel(
**era5_europe_cities["2m_temperature"].argmax(dim=["time", "geometry"])
)
max_temp_buffer = era5_europe_cities.xvec.query(
"geometry", max_temp_city["geometry"].data.item().buffer(0.5)
)
max_temp_buffer
<xarray.Dataset> Size: 38kB Dimensions: (time: 1584, geometry: 2) Coordinates: * time (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31... * geometry (geometry) object 16B POINT (-16.2546158 28.4682386)... Data variables: 2m_temperature (time, geometry) float32 13kB 280.3 280.3 ... 289.5 u_component_of_wind (time, geometry) float32 13kB 20.11 20.11 ... nan nan Indexes: geometry GeometryIndex (crs=EPSG:4326)
2. Raster analysis¶
Now that we have a multidimensional vector data cube, we can leverage Xarray's functionality for computations and reductions along labeled dimensions. For example, let us compute seasonal means of the data:
era5_europe_cities_seasons = era5_europe_cities.groupby("time.season").mean()
era5_europe_cities_seasons
<xarray.Dataset> Size: 114kB Dimensions: (season: 4, geometry: 2844) Coordinates: * geometry (geometry) object 23kB POINT (19.8188896 41.3275) ..... * season (season) object 32B 'DJF' 'JJA' 'MAM' 'SON' Data variables: 2m_temperature (season, geometry) float32 46kB 279.9 282.4 ... 282.2 u_component_of_wind (season, geometry) float32 46kB 55.08 55.18 ... 48.57 Indexes: geometry GeometryIndex (crs=EPSG:4326)
Let's visualize a single season
Click to show code
(
era5_europe_cities_seasons["u_component_of_wind"]
.sel(season="DJF")
.xvec.to_geodataframe(geometry="geometry")
.explore("u_component_of_wind")
)