Wicked smaht dynamic map tile rendering of Icechunk/Zarr data with xpublish-tiles

Live dynamic tiling using Earthmover's Flux Tiles service

TL;DR

We are launching a new open-source library xpublish-tiles that powers our new Flux Tiles service, which allows Earthmover Platform users to view their data on a slippy map with dynamically rendered tiles at lower zoom levels than was possible previously.

Why

The premise of Flux, Earthmover’s data delivery service, is extremely simple: click a button, and get your data delivered over standard protocols to your application or framework of choice, without making a second copy of the data. Flux enables Earthmover Platform users to focus on building great data products instead of managing and scaling complex cloud infrastructure. Flux is an open-core product built on the XPublish framework, created by our CTO Joe Hamman. XPublish allows one to quickly and easily serve Xarray datasets over a REST API using FastAPI. XPublish plugins enable great extensibility and the ability to serve your protocol of choice. Flux, our fully managed service, makes it easy to deploy XPublish in a production setting by handling hosting, scaling, authentication / authorization, logging, and by integrating seamlessly with the customer’s data catalog.

With geospatial data, you almost always want to visualize it ASAP. The initial launch of Flux included the ability to run an OGC Web Map Service (WMS) service via the xpublish-wms plugin. This service proved to be immensely popular with our customers but immediately surfaced some challenges:

The data needed to be just right for the service to render it. For example, the data needed to be in lat/lon (EPSG:4326); and needed to be annotated with appropriate CF metadata.
Sometimes renders could be quite slow without much feedback to users.
Test coverage for xpublish-wms was low, and adding new features often surfaced new bugs.

Ultimately, the WMS service worked well for a small subset of the wide variety of customer data we handle. This didn’t sit right with our vision for how magical Arraylake & Flux should feel.

What

Two months ago, we embarked on a rewrite that would deliver faster renders and work with the wide variety of datasets on the Earthmover platform with minimal requirements imposed on the data. Today we are open-sourcing xpublish-tiles,the core library underlying our new Tiles service which serves dynamic tiles through the OGC Tiles API. This blogpost describes some aspects of the library, the thinking behind it, and how we used Claude Code to generate test cases and debugging tools, allowing us to move fast while maintaining correctness.

xpublish-tiles is designed to handle a wide variety n-dimensional array datasets stored in Zarr/Icechunk with support a wide variety of grids including rectilinear grids, raster grids defined by an affine transform, as well as curvilinear and (soon) unstructured grids common in climate modeling. These datasets may be global or regional, have increasing or decreasing latitude coordinates, and/or longitude coordinates in -180°→180° or 0°→360° convention.

A display of different types of geospatial data grids, including rectilinear, curvilinear, HEALPix, and projected — A selection of grids used by datasets in Arraylake. Regularly spaced, Rectilinear, grids defined by an affine transform (common to GeoTIFFs), grids in projected coordinate space, curvilinear meshes, discrete global grid systems, (reduced) gaussian grids used for atmospheric modeling, triangular meshes, and the Pepsi Globe (OK that's a joke 😀).

Fundamentally xpublish-tiles is an open-source GDAL-free dynamic tiling package with support for a wide range of grid types and excellent performance characteristics.

How

Lifecycle of a tile request

The rendering pipeline is roughly:

Receive a user request, translate to a CRS and a bounding box using morecantile for handling tile metadata.
Understand the grid and coordinate system for the requested data using cf-xarray.
Subset the data to that needed to render a single tile. To ensure contiguity of rendering between adjacent tiles with the API offered by datashader, we need to ensure that we have at least 1 data point outside the viewport. Doing this becomes extra fun when you need to handle the anti- and prime meridians with data that may use 0°→360° or a -180°→180° convention for longitude.
Transform the input coordinates to the output coordinates, if necessary, using pyproj. Address any discontinuities at the anti-meridian that arise from this transformation.
Load the data ASAP using xarray/ zarr/ icechunk async data loading.
Render the data using datashader.

Correctness

Before we experimented with ideas for faster renders, we had to be sure that we weren’t compromising render correctness. Pretty early on, we built out a test suite relying on a wide variety of synthetic datasets inspired by actual customer datasets.

Snapshot Testing

We chose to rely on snapshot testing (🥁) of rendered tiles with the pytest extension syrupy, where we generated an initial set of renders, saved them as PNGs, and constantly tested the current render against those saved renders for regressions.

For each dataset we render a set of tiles — initially generated using Claude Code instructed to generate tiles that tested edge cases like the equator, prime meridian, anti-meridian, or tiles that contained the corners of limited-area datasets.

For the most part these images need to look “right” , that is, there cannot be empty pixels within the dataset bounds, and there shouldn’t be obvious discontinuities between neighboring tiles. These properties were checked manually using an example Mapbox app. Once satisfactory, we saved these snapshots for future regression testing. As the code evolved, these snapshots need to be updated, commonly to address small changes at the boundaries. Comparing bytestrings representing images on the command line isn’t the best experience, so we had Claude Code generate us some very useful debugging tools. For example, running a failing test with pytest TEST_NAME --debug-visual pops up this wicked three panel figure comparing the expected saved rendering, the actual rendering, and a binary pixel map of differences with specific emphasis on transparency changes, which generally indicates an error in subsetting, padding, or unwrapping the coordinates after transformation. This code isn’t pretty but it’s cheap and proven very effective at diagnosing bugs! 👌🏾

tile rendering testing, snapshot testing with syrupy, xpublish-tiles debugging, regression testing for map tiles, Claude Code generated tools — Debugging a failing snapshot test with a three panel figure. On the left is the saved snapshot, in the middle is the new render which appears very similar to the expectation; and on the right, pixels with differences are colored in red.

Property testing

To build extra confidence that we could handle any dataset thrown at us, we built out a nascent suite of property tests using the excellent hypothesis library. The properties listed below are tested to hold regardless of whether the data are provided in -180°→180° or 0°→360° longitude convention, whether the latitude values are increasing or decreasing, and for arbitrary data shapes.

Property 1: For a generated dataset with global coverage, there must be no transparent pixel in any requested tile in any requested tiling system. Very simple, and very effective.

Rectilinear vs Curvilinear coordinates, datashader path optimization, anti-meridian seam fix — NOAA Global Forecast System data (courtesy of Dynamical) rendered with xpublish-wms on the left, and xpublish-tiles on the right. Note the vertical seam at the anti-meridian (180°W) in the left panel; this is a column of transparent pixels that should never exist for a dataset with global coverage.

Property 2: Next consider a dataset with rectilinear coordinates i.e. the coordinate system is defined by two 1D arrays along orthogonal axes. This is identical to a dataset with curvilinear coordinates generated by broadcasting the two 1D arrays against each other. Now the same tile rendered from both datasets must be identical. In practice, because of algorithmic differences in datashader, we settle for the images being extremely similar using the Structural Similarity metric from scikit-image.

Identical data on an identical grid represented two ways: rectilinear and curvilinear — Identical data on an identical grid represented two ways: rectilinear (two 1D coordinate vectors), and curvilinear (two 2D coordinate arrays).

These tests were absolutely crucial for ensuring that we can render a wide variety of datasets without unsightly seams or missing pixels resulting from bad subsetting and coordinate transformation logic. These tests also ended up detecting a few upstream bugs in the datashader library that we then fixed.

Performance

We started out with the original xpublish-wms design with

Sync FastAPI endpoint that executed sync but non-blocking data load in the FastAPI threadpool;
Always broadcasting the X and Y coordinate variables against each other; and
Transforming and rendering those broadcasted variables.

After some extensive profiling of our synthetic workloads with py-spy, we discovered that this approach was suboptimal in many cases. A key learning was that transforming coordinate values from the input to output coordinate system was taking ~50-60% of render time. Next we discuss a few tricks we used to speed up performance on the service. While faster renders are nicer, the real change here is our ability to render higher resolution data at lower zoom levels than was possible previously.

Async loading

xpublish-tiles provides an async endpoint, enabled by upstream work to support async loading of data in Xarray, which in turn rests upon extensive work to add async support in Zarr V3. While FastAPI does use worker threads to handle multiple concurrent requests to sync endpoints, control over compute resources is lost by the user. By creating async routes instead, xpublish-tiles can allow compute bound work (reprojection & rendering) to happen concurrently with I/O bound data loading, while also setting specific resource limits for worker threads to carry out compute bound workflows. In the original WMS implementation, Out of Memory (OOM) errors were common because as compute bound image requests piled up, many threads could be doing intensive rendering in parallel.

Rectilinear rendering is fast

datashader has 3 rendering paths: one for uniformly spaced 1D coordinates (“raster”), one for non-uniformly spaced 1D coordinates (“rectilinear”), and one for 2D coordinate variables (“curvilinear”).

+---------------------+---------------------+---------------------+
|   RASTER            |   RECTILINEAR       |   CURVILINEAR       |
+---------------------+---------------------+---------------------+
|                     |                     |                     |
| o---o---o---o---o   | o-o---o----o-o      |    o---o---o---o    |
| |   |   |   |   |   | | |   |    | |      |   /   /   /   /     |
| o---o---o---o---o   | o-o---o----o-o      |  o----o--o---o      |
| |   |   |   |   |   | | |   |    | |      | /   /    /  /       |
| o---o---o---o---o   | o-o---o----o-o      | o--o----o---o       |
| |   |   |   |   |   | | |   |    | |      |  \   \    \  \      |
| o---o---o---o---o   | o-o---o----o-o      |   o---o---o---o     |
|                     |                     |                     |
|                     |                     |                     |
| Regular spacing     | Variable spacing    | Variable 2D spacing |
| dx = dy = constant  | dx & dy vary in 1D  | dx & dy vary in 2D  |
|                     |                     |                     |
| x[i] = i * dx       | x[i], y[j]          | x[i,j], y[i,j]      |
| y[j] = j * dy       |                     |                     |
+---------------------+---------------------+---------------------+

Here’s how long each path takes to render an image in milliseconds for a range of array shapes on a Macbook Air M2 (from datashader’s new benchmark suite that we contributed to). Note that this is after a very large improvement to the curvilinear code path.

Size (NxN)	256	512	1024	2048	4096	8192
raster	1 ms	1 ms	1 ms	5 ms	10 ms	30 ms
rectilinear	5 ms	8 ms	15 ms	30 ms	150 ms	600 ms
curvilinear	5 ms	8 ms	20 ms	100 ms	400 ms	1500 ms

Clearly it’s good to preserve the rectilinear nature of the grid as much as possible. Can that be done? Sometimes, it can!

Faster & smarter coordinate transforms

For most transformations between coordinate systems, an input uniformly spaced or rectilinear grid is transformed to a curvilinear grid. This is not true for the EPSG:4326 lat/lon to WebMercator (EPSG:3857) transform. And WebMercator is by far the most common choice for web map visualization.

Most coordinate reference systems choose to compromise between preserving area, preserving angles, and preserving distance. With WebMercator, Google chose speed 😆. Importantly, with WebMercator, x is a function of longitude only; and y is a function of latitude only. This means the transformation from lat/lon to WebMercator preserves rectilinearity, and scales linearly with number of points (nlon + nlat) instead of quadratically (nlon * nlat)! We reimplemented this transform and gained 3X faster total rendering time for a very common use case on our platform — high resolution O(5km) global data with rectilinear lat/lon coordinates being rendered as WebMercator tiles.

Coarsening

The single largest factor in determining rendering time is the number of data points. This scaling applies to both coordinate transformation and image rendering. We use Xarray’s coarsen or block average in non-overlapping windows to reduce the data down so that we are never rendering data that is more than 16X the output image size i.e. we have at most 4 input pixels per output pixel. In practice, this generally means we reduce the data down to a shape less than (3000, 3000) for rendering. While this is another compute step in a compute-bound pipeline, it ended up speeding up our benchmarks by a factor of 2-6, for cases where coordinate transformation was very expensive.

Pre-computed transformed coordinates

xpublish-tiles can also discover pre-computed transformed coordinates for the data using Climate and Forecast (CF) conventions for multiple grid mappings. In practice, this shifts the burden from the compute cost of computing the transformation to the I/O cost of reading extra data. We are still learning how well this works with real datasets.

Comparison with Alternatives

There are a good number of options out there for serving raster map tiles from original source data, both proprietary and open source. MapBox, Felt, ESRI, Fused, and Earthscale all offer some flavor of customizable raster tiling via a cloud-hosted service. Of these, only Fused and Earthscale support Zarr.

On the open source side, TiTiler is the main game in town. TiTiler originated as a way to dynamically serve tiles and mosaics from cloud-optimized GeoTIFFs. Later it grew Xarray (and thus Zarr) capabilities via the titiler.xarray module. So it’s natural to wonder how xpublish-tiles compares to TiTiler.

First, let’s say very clearly: we love TiTiler and the Development Seed team that maintains it. TiTiler is an amazing and powerful tool; we just had a different set of requirements to satisfy with xpublish-tiles. Note we’ve also used morecantile, an associated project, in xpublish-tiles. We’ve done our best at an honest and fair comparison here, but we welcome any feedback or corrections from the TiTiler team.

First, the features. TiTiler explicitly supports a wide range of input sources, while xpublish-tiles only serves tiles from Xarray datasets. Xarray can read just about any raster formats, so this difference is actually minimal. TiTiler definitely has more bells and whistles, supporting a wide range of expressions and shading / coloring techniques. In this respect, xpublish-tiles is more bare-bones, although such features are definitely on the roadmap. TiTiler’s ability to mosaic level-2 scenes on the fly is powerful and not something we plan to support in xpublish-tiles.

A big goal with xpublish-tiles was to have a package in which we could both understand and continuously improve rendering performance. TiTiler relies on GDAL for the core reprojection and resampling steps. While GDAL is certainly powerful, it’s not something that we can easily tweak or modify. (That’s a good thing! GDAL is super mature and stable.) Instead, as described above, we built a GDAL-free rendering pipeline in Python, leveraging high-performance libraries like datashader for some of the expensive steps, but also building on excellent GDAL-adjacent projects like PROJ.

We were pleasantly surprised to see that, in our benchmarking suite, xpublish-tiles compares very favorably to TiTiler and GDAL. In certain configurations with high-resolution grids, we found it to be 10x faster! We hypothesize that these differences are due to the cleverness around EPSG:4326 to WebMercator transformation (ifs and global_6km); the ability to specify precomputed reprojected coordinates (utm_50s), and because we block average the data on-the-fly while TiTiler expects to be used with pre-computed overviews for high resolution data (global_6km). Indeed, we designed these approaches in order to do well on these specific benchmarks! These approaches could certainly be ported over, following the best traditions of open-source.

Bar chart comparing the performance of xpublish-tiles against TiTiler across six different datasets, showing significantly faster speeds for xpublish-tiles on several. — **Disclaimer:** Benchmarking is hard and almost always biased. The benchmark suite is open source, and we welcome any expert tweaks to the TiTiler side which might impact results.

Our managed service–Flux Tiles–offers a number of conveniences and advantages over any self-hosted tile service, namely:

Fully managed service–enable tile serving with the click of a button with no infrastructure to manage.
Tight integration with the Arraylake data catalog
Dynamic autoscaling to support high load
Authentication and authorization, including fine-grained access control
Logging and metrics
Sophisticated caching layer tightly integrated with Icechunk, enabling further performance optimizations on the server side.

Limitations and Roadmap

As noted above, xpublish-tiles is very good at what it does, but is still lacking some important capabilities. Here are some of the big features on the roadmap:

Support for overviews stored via the GeoZarr multiscale specification. One of the biggest limitations of xpublish-tiles is that every single tile is generated dynamically from the original high-resolution data. This works just fine for typical weather and climate datasets. But for high-resolution EO data, it really bogs down at low zoom levels. One solution is to use pre-computed lower-resolution overviews, similar to how CoGs work. We’re excited to help push forward the standard for overviews in Zarr at the upcoming Zarr Summit!
RGB and false-color image support. Right now, xpublish-tiles only renders scalar fields (using your colormap of choice). We’re planning for flexible RGB support, including dynamic construction of false-color RGB images from hyperspectral bands.
Dynamic “pixel math”. Computing fields like NDVI on the fly is a must.

If you’re interested in getting involved in development, head over to the GitHub repo!

Summary

Earthmover now offers the most flexible and performant way to dynamically turn Zarr data into map tiles. Users can choose from:

A best-in-class open-source, self-hosted option: xpublish-tiles
A turnkey managed service: Flux Tiles

Earthmover Platform users now have access to a more flexible and more performant dynamic map tiler with a click of a button. We are finding that our clients love developing visualizations directly from the source data without having to generate a second copy of their data in other formats (e.g. Cloud-Optimized GeoTIFF).

We learned a lot building this new library, and we hope the blog post has helped pull back the covers a bit on how dynamic tiling works. At Earthmover, we think showing our customers exactly how our platform works, via open source, is a great way to win the trust of developers.

The underlying technology is now available in the open-source xpublish-tiles library. As an xpublish plugin, it can easily be added to any existing xpublish deployment. We also encourage you to try out xpublish-tiles locally. It comes with a CLI that can serve any local Icechunk/zarr dataset —

uv run xpublish-tiles —dataset icechunk://path/to/store —group group

Or if you’re interested in exploring Flux Tiles, our fully managed service, get started with Earthmover today!