For brevity's sake, we'll extract a shorter time series and only a few data variables.

In [1]:
Click to show code
import xarray as xr

uri = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_ds_sub = (
    # Open the dataset
    xr.open_zarr(uri, chunks={"time": 48}, consolidated=True)
    # Select the near-surface level
    .isel(level=0, drop=True)
    # subset in time
    .sel(time=slice("2017-01", "2018-01"))
    # reduce to two arrays
    [["2m_temperature", "u_component_of_wind"]]
)
era5_ds_sub
Out[1]:
<xarray.Dataset> Size: 13GB
Dimensions:              (time: 1584, latitude: 721, longitude: 1440)
Coordinates:
  * latitude             (latitude) float32 3kB 90.0 89.75 89.5 ... -89.75 -90.0
  * longitude            (longitude) float32 6kB 0.0 0.25 0.5 ... 359.5 359.8
  * time                 (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31...
Data variables:
    2m_temperature       (time, latitude, longitude) float32 7GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
    u_component_of_wind  (time, latitude, longitude) float32 7GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
xarray.Dataset
    • time: 1584
    • latitude: 721
    • longitude: 1440
    • latitude
      (latitude)
      float32
      90.0 89.75 89.5 ... -89.75 -90.0
      long_name :
      latitude
      units :
      degrees_north
      array([ 90.  ,  89.75,  89.5 , ..., -89.5 , -89.75, -90.  ], dtype=float32)
    • longitude
      (longitude)
      float32
      0.0 0.25 0.5 ... 359.2 359.5 359.8
      long_name :
      longitude
      units :
      degrees_east
      array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
             3.5975e+02], dtype=float32)
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2018-01-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2018-01-31T06:00:00.000000000',
             '2018-01-31T12:00:00.000000000', '2018-01-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • 2m_temperature
      (time, latitude, longitude)
      float32
      dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      Array Chunk
      Bytes 6.13 GiB 190.11 MiB
      Shape (1584, 721, 1440) (48, 721, 1440)
      Dask graph 34 chunks in 3 graph layers
      Data type float32 numpy.ndarray
      1440 721 1584
    • u_component_of_wind
      (time, latitude, longitude)
      float32
      dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      Array Chunk
      Bytes 6.13 GiB 190.11 MiB
      Shape (1584, 721, 1440) (48, 721, 1440)
      Dask graph 34 chunks in 4 graph layers
      Data type float32 numpy.ndarray
      1440 721 1584
    • latitude
      PandasIndex
      PandasIndex(Index([  90.0,  89.75,   89.5,  89.25,   89.0,  88.75,   88.5,  88.25,   88.0,
              87.75,
             ...
             -87.75,  -88.0, -88.25,  -88.5, -88.75,  -89.0, -89.25,  -89.5, -89.75,
              -90.0],
            dtype='float32', name='latitude', length=721))
    • longitude
      PandasIndex
      PandasIndex(Index([   0.0,   0.25,    0.5,   0.75,    1.0,   1.25,    1.5,   1.75,    2.0,
               2.25,
             ...
              357.5, 357.75,  358.0, 358.25,  358.5, 358.75,  359.0, 359.25,  359.5,
             359.75],
            dtype='float32', name='longitude', length=1440))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2018-01-29 12:00:00', '2018-01-29 18:00:00',
                     '2018-01-30 00:00:00', '2018-01-30 06:00:00',
                     '2018-01-30 12:00:00', '2018-01-30 18:00:00',
                     '2018-01-31 00:00:00', '2018-01-31 06:00:00',
                     '2018-01-31 12:00:00', '2018-01-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=1584, freq=None))

Here's an image of the "2m_temperature' variable at a single time-step.

In [2]:
(era5_ds_sub.isel(time=-2)["2m_temperature"].plot(robust=True, figsize=(10, 6)))
Out[2]:
<matplotlib.collections.QuadMesh at 0x7f0a6a52de80>
No description has been provided for this image

Read vector data¶

ERA5 outputs are gridded datasets covering the entire globe. We only need data about cities in Europe, so continuing to store the global dataset is a pain and unnecessary. We use a vector dataset from Hugging Face and format it as a gpd.GeoDataFrame containing data covering the European continent.

In [3]:
Click to show code
import geopandas as gpd
import pandas as pd

cities_df = pd.read_json(
    "hf://datasets/jamescalam/world-cities-geo/train.jsonl", lines=True
)
cities_eur = cities_df.loc[cities_df["continent"] == "Europe"]
cities_eur = gpd.GeoDataFrame(
    cities_eur,
    geometry=gpd.points_from_xy(cities_eur.longitude, cities_eur.latitude),
    crs="EPSG:4326",
).drop(["latitude", "longitude", "x", "y", "z"], axis=1)
cities_eur.head()
Out[3]:
city country region continent geometry
84 Tirana Albania Southern Europe Europe POINT (19.81889 41.32750)
85 Durres Albania Southern Europe Europe POINT (19.44139 41.32306)
86 Elbasan Albania Southern Europe Europe POINT (20.08222 41.11250)
87 Vlore Albania Southern Europe Europe POINT (19.48972 40.46667)
88 Shkoder Albania Southern Europe Europe POINT (19.51258 42.06828)

Sampling a raster data cube using vector geometries¶

We've created an vector dataframe (cities_eur) storing data about European cities. This object has the geographic information needed to subset the ERA5 dataset era5_ds_sub. We now need to extract the points from era5_ds_sub that align with the city geometry dimension in cities_eur. We can use the Xvec method xvec.extract_points() for this operation.

In [5]:
import xvec

era5_europe_cities = era5_ds_sub.xvec.extract_points(
    cities_eur.geometry, x_coords="longitude", y_coords="latitude"
).drop_vars("index")
era5_europe_cities
Out[5]:
<xarray.Dataset> Size: 36MB
Dimensions:              (time: 1584, geometry: 2844)
Coordinates:
  * time                 (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31...
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
    2m_temperature       (time, geometry) float32 18MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
    u_component_of_wind  (time, geometry) float32 18MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 1584
    • geometry: 2844
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2018-01-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2018-01-31T06:00:00.000000000',
             '2018-01-31T12:00:00.000000000', '2018-01-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      dask.array<chunksize=(28, 2844), meta=np.ndarray>
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      Array Chunk
      Bytes 17.18 MiB 533.25 kiB
      Shape (1584, 2844) (48, 2844)
      Dask graph 34 chunks in 5 graph layers
      Data type float32 numpy.ndarray
      2844 1584
    • u_component_of_wind
      (time, geometry)
      float32
      dask.array<chunksize=(28, 2844), meta=np.ndarray>
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      Array Chunk
      Bytes 17.18 MiB 533.25 kiB
      Shape (1584, 2844) (48, 2844)
      Dask graph 34 chunks in 6 graph layers
      Data type float32 numpy.ndarray
      2844 1584
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2018-01-29 12:00:00', '2018-01-29 18:00:00',
                     '2018-01-30 00:00:00', '2018-01-30 06:00:00',
                     '2018-01-30 12:00:00', '2018-01-30 18:00:00',
                     '2018-01-31 00:00:00', '2018-01-31 06:00:00',
                     '2018-01-31 12:00:00', '2018-01-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=1584, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      GeometryIndex(
          [<POINT (19.819 41.328)>
           <POINT (19.441 41.323)>
           <POINT (20.082 41.112)>
           <POINT (19.49 40.467)>
           ...
           <POINT (39.74 48.295)>
           <POINT (33.467 50.75)>
           <POINT (27.067 50.183)>
           <POINT (30.217 50.567)>],
          crs=EPSG:4326)

Cool, now we have a 2-dimensional vector data cube! We went from a 3-d raster datacube with [time, latitude, longitude] dimensions to a 2-d datacube [time, geometry] where the only spatial dimension is 'geometry'.

Up until now, we have been performing lazy operations, meaning that we haven't actually loaded the ERA5 data we accessed from Google Cloud Storage into memory on our local machine. We'll do this now so that we can perform computations that require the data in memory.

In [6]:
from dask.diagnostics import ProgressBar

with ProgressBar():
    era5_europe_cities = era5_europe_cities.load()
era5_europe_cities
[########################################] | 100% Completed | 223.03 s
Out[6]:
<xarray.Dataset> Size: 36MB
Dimensions:              (time: 1584, geometry: 2844)
Coordinates:
  * time                 (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31...
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
    2m_temperature       (time, geometry) float32 18MB 271.9 278.4 ... 274.6
    u_component_of_wind  (time, geometry) float32 18MB 112.1 112.7 ... nan nan
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 1584
    • geometry: 2844
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2018-01-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2018-01-31T06:00:00.000000000',
             '2018-01-31T12:00:00.000000000', '2018-01-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      271.9 278.4 269.9 ... 273.6 274.6
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[271.87958, 278.35577, 269.85394, ..., 269.92792, 271.24445,
              272.31277],
             [270.30252, 277.62662, 267.76663, ..., 271.04095, 271.48956,
              272.74133],
             [281.74112, 282.03555, 281.34186, ..., 272.17865, 274.12875,
              273.7341 ],
             ...,
             [274.65802, 281.0777 , 272.65   , ..., 272.6241 , 274.10135,
              273.88992],
             [285.28223, 284.95712, 285.45193, ..., 272.84848, 276.0907 ,
              275.16867],
             [282.51038, 284.25662, 282.02707, ..., 271.76968, 273.55188,
              274.58466]], dtype=float32)
    • u_component_of_wind
      (time, geometry)
      float32
      112.1 112.7 110.0 ... nan nan nan
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[112.14053 , 112.654465, 110.033875, ..., 139.53139 , 138.74988 ,
              141.79944 ],
             [109.49022 , 109.24812 , 108.42415 , ..., 113.78    , 127.885284,
              129.67764 ],
             [116.612946, 116.66391 , 114.61247 , ..., 117.72574 , 131.63141 ,
              124.44496 ],
             ...,
             [ 46.010624,  46.085484,  43.914528, ...,  82.43866 ,  84.97977 ,
               78.940994],
             [       nan,        nan,        nan, ...,        nan,        nan,
                     nan],
             [       nan,        nan,        nan, ...,        nan,        nan,
                     nan]], dtype=float32)
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2018-01-29 12:00:00', '2018-01-29 18:00:00',
                     '2018-01-30 00:00:00', '2018-01-30 06:00:00',
                     '2018-01-30 12:00:00', '2018-01-30 18:00:00',
                     '2018-01-31 00:00:00', '2018-01-31 06:00:00',
                     '2018-01-31 12:00:00', '2018-01-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=1584, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      GeometryIndex(
          [<POINT (19.819 41.328)>
           <POINT (19.441 41.323)>
           <POINT (20.082 41.112)>
           <POINT (19.49 40.467)>
           ...
           <POINT (39.74 48.295)>
           <POINT (33.467 50.75)>
           <POINT (27.067 50.183)>
           <POINT (30.217 50.567)>],
          crs=EPSG:4326)

Using a vector data cube¶

We've shown how to combine raster and vector data to create a vector data cube with two indexes (geometry and time). Now, let's use it!

1. Spatial indexing¶

We can select points based on geometry and proximity to other points using the xvec.query() method. As an example, find all cities within 0.5 degree of the city with the highest single temperature estimate:

In [7]:
max_temp_city = era5_europe_cities.isel(
    **era5_europe_cities["2m_temperature"].argmax(dim=["time", "geometry"])
)
max_temp_buffer = era5_europe_cities.xvec.query(
    "geometry", max_temp_city["geometry"].data.item().buffer(0.5)
)
max_temp_buffer
Out[7]:
<xarray.Dataset> Size: 38kB
Dimensions:              (time: 1584, geometry: 2)
Coordinates:
  * time                 (time) datetime64[ns] 13kB 2017-01-01 ... 2018-01-31...
  * geometry             (geometry) object 16B POINT (-16.2546158 28.4682386)...
Data variables:
    2m_temperature       (time, geometry) float32 13kB 280.3 280.3 ... 289.5
    u_component_of_wind  (time, geometry) float32 13kB 20.11 20.11 ... nan nan
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 1584
    • geometry: 2
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2018-01-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2018-01-31T06:00:00.000000000',
             '2018-01-31T12:00:00.000000000', '2018-01-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (-16.2546158 28.4682386) P...
      crs :
      EPSG:4326
      array([<POINT (-16.255 28.468)>, <POINT (-16.317 28.483)>], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      280.3 280.3 276.4 ... 289.5 289.5
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[280.3136 , 280.3136 ],
             [276.42108, 276.42108],
             [287.49585, 287.49585],
             ...,
             [287.6369 , 287.6369 ],
             [297.33475, 297.33475],
             [289.4968 , 289.4968 ]], dtype=float32)
    • u_component_of_wind
      (time, geometry)
      float32
      20.11 20.11 18.9 ... nan nan nan
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[ 20.114271,  20.114271],
             [ 18.899544,  18.899544],
             [ 10.494137,  10.494137],
             ...,
             [-17.695807, -17.695807],
             [       nan,        nan],
             [       nan,        nan]], dtype=float32)
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2018-01-29 12:00:00', '2018-01-29 18:00:00',
                     '2018-01-30 00:00:00', '2018-01-30 06:00:00',
                     '2018-01-30 12:00:00', '2018-01-30 18:00:00',
                     '2018-01-31 00:00:00', '2018-01-31 06:00:00',
                     '2018-01-31 12:00:00', '2018-01-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=1584, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      GeometryIndex(
          [<POINT (-16.255 28.468)>
           <POINT (-16.317 28.483)>],
          crs=EPSG:4326)

2. Raster analysis¶

Now that we have a multidimensional vector data cube, we can leverage Xarray's functionality for computations and reductions along labeled dimensions. For example, let us compute seasonal means of the data:

In [8]:
era5_europe_cities_seasons = era5_europe_cities.groupby("time.season").mean()
era5_europe_cities_seasons
Out[8]:
<xarray.Dataset> Size: 114kB
Dimensions:              (season: 4, geometry: 2844)
Coordinates:
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
  * season               (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
    2m_temperature       (season, geometry) float32 46kB 279.9 282.4 ... 282.2
    u_component_of_wind  (season, geometry) float32 46kB 55.08 55.18 ... 48.57
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • season: 4
    • geometry: 2844
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • season
      (season)
      object
      'DJF' 'JJA' 'MAM' 'SON'
      array(['DJF', 'JJA', 'MAM', 'SON'], dtype=object)
    • 2m_temperature
      (season, geometry)
      float32
      279.9 282.4 278.5 ... 281.9 282.2
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[279.90842, 282.38434, 278.54138, ..., 269.77484, 270.66733,
              270.60568],
             [298.24652, 297.41168, 298.03348, ..., 293.49606, 292.32428,
              293.602  ],
             [287.4861 , 288.11087, 286.74597, ..., 282.8407 , 282.94775,
              283.48508],
             [288.40747, 290.48032, 287.38388, ..., 281.54758, 281.87946,
              282.16803]], dtype=float32)
    • u_component_of_wind
      (season, geometry)
      float32
      55.08 55.18 53.97 ... 48.86 48.57
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[ 55.08136 ,  55.175278,  53.967445, ...,  73.70194 ,  77.16759 ,
               75.208115],
             [-34.384254, -34.404396, -34.60562 , ..., -24.530237, -25.097067,
              -24.844027],
             [ 14.959206,  15.027734,  14.543729, ...,  18.257998,  18.111353,
               18.140228],
             [ 44.158398,  44.19689 ,  43.709103, ...,  48.35409 ,  48.857204,
               48.57246 ]], dtype=float32)
    • geometry
      GeometryIndex (crs=EPSG:4326)
      GeometryIndex(
          [<POINT (19.819 41.328)>
           <POINT (19.441 41.323)>
           <POINT (20.082 41.112)>
           <POINT (19.49 40.467)>
           ...
           <POINT (39.74 48.295)>
           <POINT (33.467 50.75)>
           <POINT (27.067 50.183)>
           <POINT (30.217 50.567)>],
          crs=EPSG:4326)
    • season
      PandasIndex
      PandasIndex(Index(['DJF', 'JJA', 'MAM', 'SON'], dtype='object', name='season'))

Let's visualize a single season

In [10]:
Click to show code
(
    era5_europe_cities_seasons["u_component_of_wind"]
    .sel(season="DJF")
    .xvec.to_geodataframe(geometry="geometry")
    .explore("u_component_of_wind")
)
Out[10]:
Make this Notebook Trusted to load map: File -> Trust Notebook