Ship fast, break nothing: Engineering rigor in Icechunk with property and stateful testing

When we set out to build Icechunk, we faced a hard problem: write a database-inspired array format that slots into user workloads and patterns developed by a large scientific user community over a decade without breaking anything. Icechunk provides a Zarr store enhanced with ACID transactions and version control semantics. Such Zarr stores are sources and sinks in a wide variety of complex scientific workflows spanning weather, climate, geospatial, and other domains, usually in combination with other scientific Python libraries like Xarray and Dask. So to slot into existing workflows, we needed to be compatible with all the ways the combination of Xarray/Zarr/Dask gets used in the wild, under an accelerated timeline.

We were moving fast with Icechunk: the first code committed on July 29, 2024; first public alpha release on October 10, 2024; and first stable release v1.0 on July 10, 2025. Meanwhile, Zarr V3 was in rapid development. We could write unit tests — which we did do — but how would we be confident that Icechunk could handle anything the Xarray & Zarr communities would throw at us? We needed high-leverage ways to gain confidence and build rigor into Icechunk.

Our approach:

Wire up Icechunk to existing upstream test suites (Xarray/Zarr) however we could.
Use property and stateful testing ideas — generative randomized testing — with the proptest crate & the Hypothesis Python library.
Write a small number of unit tests (primarily in Rust) that tested core Icechunk functionality.

This post describes our approach to (1) and (2), designed and implemented in the run up to the V1 release.

Upstream unit test suites

Our first choice was to wire up to existing test suites in upstream projects where we could. Zarr-Python intentionally makes this easy by providing an integration test suite to be used by libraries that implement the Zarr storage interface. Xarray, not so much.

Zarr provides a unit test suite that any Zarr store can tap into. Using this test suite involves setting up a custom store, store_kwargs, and optionally overriding methods like set and get. Roughly, this looks like (code):

from icechunk import IcechunkStore
from zarr.testing.store import StoreTests
from zarr.core.buffer import cpu

class TestIcechunkStore(StoreTests):
    # create appropriate IcechunkStore instances
    @pytest.fixture
    def store_kwargs(self, tmpdir: Path) -> dict[str, Any]:
        kwargs = {
            "storage": local_filesystem_storage(f"{tmpdir}/store_test"),
            "read_only": False,
        }
        return kwargs

    async def store(self, store_kwargs: dict[str, Any]) -> IcechunkStore:
        read_only = store_kwargs.pop("read_only")
        repo = Repository.open_or_create(**store_kwargs)
        if read_only:
            session = repo.readonly_session(branch="main")
        else:
            session = repo.writable_session("main")
        return session.store

    # override `set` and `get` if needed
    async def set(self, store: IcechunkStore, key: str, value: Buffer) -> None:
        pass

    async def get(self, store: IcechunkStore, key: str) -> Buffer | None:
        pass

    ...

    # inherited tests:
    # - `test_get`
    # - `test_set`
    # - `test_list_dir`

This was a quick and easy way to gain confidence that we were implementing Zarr semantics properly.

Xarray's test suite, on the other hand, is emphatically not designed for external use. However, Xarray's storage backend unit suite has been in development for a decade and is effectively a comprehensive record of all the wild and wonderful ways people like to use Xarray & Zarr ("region writes", "appends", "overwrites", etc.). Let's use that to our advantage!

Luckily, the Xarray backends test suite is well engineered with base classes providing common tests that are executed with a variety of storage backends (Zarr/netCDF/h5netCDF). With some minor rewiring to override the create and save methods, we could sub in Icechunk as a storage backend (code). In the wonderfully wonky world of Python, we can check out the Xarray code, subclass the backends test suite base classes, muck around with some pytest config and voilà! xarray’s backend test suite now running with Icechunk on every PR (workflow code)!

from xarray.tests.test_backends import ZarrBase

class IcechunkStoreBase(ZarrBase):
    @contextlib.contextmanager
    def create_repo(self) -> Generator[Repository]:
        # overridden by other classes that create repos backed by S3 storage, local filesystem storage, or memory storage
        raise NotImplementedError

    @contextlib.contextmanager
    def create_zarr_target(self) -> Generator[IcechunkStore]:
        with self.create_repo() as repo:
            session = repo.writable_session("main")
            yield session.store
            
    # other test functions

This hack immediately paid off, turning up a gnarly prefix-matching bug (PR). Wiring this up gave us great confidence that an Icechunk store could be swapped for a Zarr store in many complex real-world Xarray workflows without breaking anything.

Property testing

The idea behind property testing is deceptively simple: when writing a test, assert properties of the output rather than exact values. Rather than writing complex unit test cases by hand, where one might unknowingly ignore edge-cases, one instead writes generators that produce random and arbitrarily complex inputs to the test. In other words, describe how to generate inputs, rather than describing the inputs themselves. Property tests are more than randomized tests. A real strength of the technique is the idea of shrinking: on encountering a failure, property testing frameworks will attempt to minimize the failing inputs (following certain rules) to generate a minimal failing test case along with reproduction instructions. For an entertaining introduction to property testing, see this talk by Prof. John Hughes (one of the pioneers in this field).

This idea lends itself extremely well to data storage systems. Many of these Zarr tests are property tests in all but name, just being executed with specific inputs — “you can’t get what you deleted”, “you get what you set”, “you can’t delete what wasn’t set in the first place”, etc. In Zarr & Icechunk, we assert one simple property: what you set is what you get — a roundtrip test. This property is tested over a vast parameter space: arbitrary data types, shapes (both number of dimensions and dimension sizes), metadata, dimension names, chunk layouts, codecs, fill values — with and without sharding. Regardless of the exact choices, what you set should be what you get. Notice how the assertion is simple:

from zarr.testing.strategies import arrays, numpy_arrays

# pass in an arbitrary numpy array (arbitrary shape, dtype, data)
@given(data=st.data(), nparray=numpy_arrays())
def test_roundtrip(data: st.DataObject, nparray: Any) -> None:
    zarray = data.draw(
        # write a zarr array to a store with arbitrary chunking/sharding/fill_value/codecs.
        arrays(
            stores=icechunk_stores(),
            arrays=st.just(nparray),
            zarr_formats=st.just(3),
        )
    )
    # we should get what we set earlier
    assert_array_equal(nparray, zarray[:])

while the generator (now upstreamed to Zarr) is complex.

Sometimes the easiest property test to write is to assert that a thing behaves like some other known thing (an “oracle”). We use this strategy for testing Zarr’s indexing behaviour. Given an arbitrary numpy array, and a Zarr array constructed from the numpy array, indexing both should yield the same result regardless of how the Zarr array is stored on disk (e.g. arbitrary chunk sizes, arbitrary dimension names, etc). Again while the assertion is simple:

async def test_oindex(data: st.DataObject) -> None:
    # generate a "simple" zarr array backed by arbitrary data
    zarray = data.draw(simple_arrays(shapes=npst.array_shapes(max_dims=4, min_side=1)))
    # read that data
    nparray = zarray[:]
    # generate a numpy indexer, and an equivalent Zarr indexer
    zindexer, npindexer = data.draw(orthogonal_indices(shape=nparray.shape))
    # index the numpy and Zarr arrays with the same indexer
    actual = zarray.oindex[zindexer]
    # returned values must match
    assert_array_equal(nparray[npindexer], actual)

the indexer generators are not! This test hasn’t been ported to Icechunk yet, but should be. (A PR to test out our manifest splitting logic with these generators is very welcome!).

Stateful testing of Zarr operations

Property tests are great for testing storage systems. However, since Zarr is a stateful and mutable storage system, users can type out arbitrary sequences of operations that mutate the store. For example:

# create some arrays & groups
zarr.create_array(store, "foo")
zarr.create_group(store, "bar", mode="w")
zarr.create_array(store, "bar/foo")
# now overwrite
zarr.create_group(store, "bar", mode="w")
# now grabbing `bar/foo` should fail!

Stateful testing is a harness that makes property testing of such systems easy. The idea here is that we can generate an arbitrary sequence of valid operations, execute them, and after each operation, we can assert that expected invariants hold. In a sense, we model “infinite monkeys typing out a valid sequence of Zarr operations and assert that the Zarr store maintains expected invariants at every step”. Such invariants might look like “list_prefix output should include every array I’ve set in this session”. One high-leverage way to use stateful testing is to build an extremely simple model of the system under test. For simple get/set style tests, we use a simple Python dictionary as our model of a key-value store that we can then test any Zarr store against. For more complex ops we simply test that the Icechunk store replicates the behaviour of a Zarr memory store (our oracle).

Such a stateful test suite looks like this (code):

from hypothesis.stateful import invariant, RuleBasedStateMachine

class ZarrHierarchyStateMachine(RuleBasedStateMachine):
    """
    This state machine models operations that modify a zarr store's
    hierarchy. That is, user actions that modify arrays/groups as well
    as list operations. It is intended to be used by external stores, and
    compares their results to a MemoryStore that is assumed to be perfect.
    """

    def __init__(self, store: Store) -> None:
        super().__init__()

        # the system under test : IcechunkStore
        self.store = store
        # our model : Zarr's built-in MemoryStore
        self.model = MemoryStore()
        zarr.group(store=self.model)
        # Track state of the hierarchy, these should contain fully qualified paths
        self.all_groups: set[str] = set()
        self.all_arrays: set[str] = set()

    @rule()
    def add_group(self): pass
    @rule()
    def add_array(self): pass
    @rule()
    def delete_chunk(self): pass
    @rule()
    def delete_group(self): pass
    @rule()
    def delete_array(self): pass
    @rule()
    def overwrite_array(self): pass
    
    # ... other array operations

    @invariant()
    def check_list_prefix_from_root(self) -> None:
        model_list = self.model.list_prefix("")
        store_list = self.store.list_prefix("")
        assert sorted(model_list) == sorted(store_list), (
            sorted(model_list),
            sorted(store_list),
        )
        # check that our internal state matches that of the store and model
        assert all(f"{path}/zarr.json" in model_list for path in self.all_groups | self.all_arrays)
        assert all(f"{path}/zarr.json" in store_list for path in self.all_groups | self.all_arrays)

On running this, Hypothesis will execute arbitrary sequences of functions decorated with rule, and after each execution run the checks in the functions decorated with invariant.

...
Adding group: path='st'
Checking 2 expected keys vs 2 actual keys
Draw 5 (Group deletion target): 'st'

Deleting group 'group_path='st'', prefix='', group_name='st' using delete
Checking 1 expected keys vs 1 actual keys

Adding group: path='gyr'
Checking 2 expected keys vs 2 actual keys

Draw 6 (Group parent): 'gyr'
Adding group: path='gyr/0mtc2f'
Checking 3 expected keys vs 3 actual keys

Draw 7 (Group deletion target): 'gyr/0mtc2f'
Deleting group 'group_path='gyr/0mtc2f'', prefix='gyr', group_name='0mtc2f' using delete
Checking 2 expected keys vs 2 actual keys

Draw 8 (Group parent): 'gyr'
Adding group: path='gyr/fnjjt62'
Checking 3 expected keys vs 3 actual keys

...

Adding array:  path='g9g3'  shape=(3,)  chunks=RegularChunkGridMetadata(chunk_shape=(3,))
Checking 7 expected keys vs 7 actual keys

...

On the Icechunk side, we subclass this stateful test suite and add Icechunk-specific operations (code) so that we can interleave Icechunk-specific operations with Zarr operations.

class ModifiedZarrHierarchyStateMachine(ZarrHierarchyStateMachine):
    def __init__(self):
        # suitably initialize with a new IcechunkStore
        pass

    @rule
    def commit_with_check(self): pass
    @rule
    def rewrite_manifests(self): pass
    @rule
    def reopen_repository(self): pass

That commit_with_check is an important step! It asserts that invariants are satisfied both before and after committing — because of the internal architecture of Icechunk’s core Rust library, this ends up asserting consistency of two mostly independent code-paths. This invariant has proved extremely effective in practice, catching multiple bugs in early development.

Stateful testing of Repo operations

The previous section discussed stateful testing of Zarr operations. Icechunk Repositories have their own semantics around version control, e.g. sessions, branches, tags, expiration, garbage collection etc. And again, users are allowed to execute arbitrary sequences of such operations.

We created another stateful test suite around these Repository operations. Importantly, we deliberately narrow the input space by setting plain JSON metadata documents on IcechunkStore. This way Hypothesis spends its effort exploring interesting combinations of Repository operations without getting distracted by arbitrary Zarr operations. Such specialization is key to effective property & stateful testing: because these methods are a random search through input space, it pays to limit that input space to explore interesting regions that may contain bugs.

Here again, we use the model-checking idea: we build a very simple model of an IcechunkStore backed by an in-memory dictionary and any extra metadata needed to reproduce Icechunk semantics. Such a test looks like (code):

class VersionControlStateMachine(RuleBasedStateMachine):
    commits = Bundle("commits")
    tags = Bundle("tags")
    branches = Bundle("branches")

    def __init__(self) -> None:
        super().__init__()
        # initialize our bespoke simple model
        self.model = Model()
        # initialize our system under test
        self.storage: Storage | None = ic.in_memory_storage()
        self.repo = ic.Repository.create(storage=self.storage)

    # only set extremely simple metadata documents on the store
    @rule
    def set(self): pass
    # next a set of "version control" operations that can be executed on Icechunk Repository objects
    @rule
    def create_branch(self): pass
    @rule
    def delete_branch(self): pass
    @rule
    def reset_branch(self): pass
    @rule
    def checkout_branch(self): pass
    @rule
    def expire_snapshots(self): pass
    @rule
    def garbage_collect(self): pass
    # ... more rules
    
    @invariant()
    def checks(self): 
        self.check_list_prefix_from_root()
        self.check_tags()
        self.check_branches()
        self.check_ancestry()
        self.check_ops_log()
        self.check_repo_info()
        self.check_file_invariants()

We found that building a model this comprehensive immensely helped when we built Icechunk V2 (stay tuned for a future blog post on this!).

Reflections

Engineering rigor in the wider open-source ecosystem

At Earthmover, we contribute to maintaining and driving forward a range of community open-source projects including Xarray and Zarr. Much of the work described here (property test suites, stateful test suites, strategies) were all initially built and refined in Icechunk, and then upstream-ed to the Zarr project where it has paid great dividends. Similar ideas have also been ported to Xarray — for example, a stateful test suite testing a wide variety of Xarray operations that tend to be buggy, as well as property test suite around indexing (code).

What about property tests in Rust?

We did build some nascent property and stateful testing in Rust using the proptest crate. However, the ergonomics of hypothesis were too nice to ignore: in particular the @composite decorator is a standout when building complex strategies. Using Python also had the nice benefit of meeting our users where they (mostly) are. We are quite excited about the new Hegel initiative which seeks to bring Hypothesis property testing idioms to many languages.

A cautionary tale

Property testing is absolutely great but can breed overconfidence. Here is one scary example.

In this Zarr PR, I (Deepak) introduced an optimization that skipped an unnecessary read when overwriting an entire chunk. This is a great idea, but involved propagating information deep through Zarr’s codec pipeline. This was hairy, and I relied greatly on the existing unit, property, and stateful tests in Zarr. Very quickly after the next release, an issue was opened describing a workflow that lost data 😱, bisected down to my PR. Upon studying the tests, I quickly realized that the generators mostly generated arrays of size 0 or 1 and rarely tested Zarr arrays with multiple chunks where the last chunk is smaller than the rest — this is the most common type of Zarr array in the wild! Instead we had been spending much of our effort testing indexing of arrays with arbitrary attributes, arbitrary dimension names, and other complexities that don’t affect the write operation. In the followup PR, I worked to change the generator so that we preferentially generated arrays with multiple chunks and smaller last chunks. We also have a new simple_arrays generator that preferentially de-emphasizes some aspects of the input space (e.g. attributes).

Lesson learned: property testing is only as good as its generators. While Hypothesis encourages us to think about “simplicity”, what we need to be thinking of is “what is simple yet complex enough to exercise potentially buggy code paths”. Clearly what we don’t do well yet is “testing our generators” i.e. asserting that the generators generate an expected distribution of test cases.

Looking ahead

We’ve done a lot to engineer rigor into Icechunk and Zarr, but there’s more to be done. In our experience, the effort invested here has paid off greatly. It has allowed us to move fast, and ship confidently, with very few severe bugs showing up after release. In the next post in this serious, we describe our approach to engineering rigor in the push to Icechunk 2 including concurrency permutation testing, and fault injection testing. Stay tuned!