Ship fast, break nothing: Engineering rigor in Icechunk with cross-version stateful compatibility testing

This third post in a three-part series describes how we developed a novel approach to rigorous cross-version compatibility testing. You can also read the first post on property and stateful testing and the second post on concurrency and fault injection testing.

In a single Python process, Hypothesis drives a randomized interleaving of Icechunk 1.x and 2.x calls against the same on-disk repo, and shrinks any failure down to its minimal reproducer (jump to the state machine).

We extend the same trick of running two versions in one process to give us a second cross-version test for garbage collection, where we use Icechunk 1.x itself as the reference model (jump to the GC trick).

Suffering from Success

After we released Icechunk 1.0, usage exploded faster than we could have imagined. Serious workflows are being built on top of Icechunk 1.x. For example, just look at the datasets on the marketplace, many of which are backed by continuously updating multi-terabyte distributed write pipelines. We have even started to see government agencies distributing their datasets in our format.

This is fantastic; we are so excited that Icechunk has seen rapid adoption. But this adoption by so many users helped us discover limitations of format v1 that we could not have anticipated. So we started work on an even better on-disk format, format v2, that solved all the pain points we discovered. Icechunk 2.x is a marked improvement across the board compared to Icechunk 1.x.

The amazing adoption of Icechunk 1.x left us with a significant challenge when releasing Icechunk 2.x. Icechunk repos are collaborative and can be accessed by many readers and writers. We also know that not all users will update to Icechunk 2.x at once, and not everyone will do our metadata-only repo migration in a timely manner. So there will be users of Icechunk 1.x and 2.x interacting on the same format v1 repos for their data workloads.

The core promise of Icechunk is that your data is safe and versioned. In order to uphold that we needed to be confident that:

Icechunk 2.x can read and write to format v1 repos
Icechunk 2.x NEVER breaks a format v1 repo for Icechunk 1.x readers

Achieving this requires both:

Careful engineering
Rigorous testing

So we set out to design testing that would explore all the possible ways we could break cross-version compatibility.

A first attempt: scripts and bash glue

Two scripts, two versions

Our very first version of these tests to get started is super simple: write two scripts that interact with the repo and run them one after the other, checking that nothing breaks as you do.

# write_with_ic2.py
import icechunk as ic
import zarr

repo = ic.Repository.open_or_create(ic.local_filesystem_storage("./repo"))
session = repo.writable_session("main")
root = zarr.open_group(session.store)
root.create_array("temperature", shape=(10,), dtype="f4")[:] = 1.0
session.commit("write from ic2")

# read_with_ic1.py
import icechunk as ic  # v1.x
import zarr

repo = ic.Repository.open(ic.local_filesystem_storage("./repo"))
session = repo.readonly_session(branch="main")
arr = zarr.open_group(session.store)["temperature"]
assert arr[0] == 1.0

uv run --with "icechunk==2.*" python write_with_ic2.py
uv run --with "icechunk==1.*" python read_with_ic1.py

BUT that’s not very ergonomic for testing as you develop, and is really more of a smoke test as it barely exercises the Icechunk API.

Stateful tests, one version at a time

As we learned from the Icechunk 1.0 release, users will inevitably find ways to use your software that you didn’t expect. So to be as confident as possible in our cross-version compatibility, we turned to our favorite tool for bug hunting: Hypothesis. As discussed in part 1 of this series we have developed a sophisticated suite of tests that push the limits of the Icechunk API.

However, our Hypothesis test suite was designed for testing only a single version of Icechunk at a time. So we needed to find a way to adapt our rich and detailed testing model for our new problem of cross-version compatibility testing.

The simplest way to do this would be to extend our two scripts working on a shared repo. Instead of simply reading and writing an array we could have the scripts exercise our full suite of hypothesis rules to explore the Icechunk API surface.

Then the script and testing workflow would look like this:

# test_stateful.py
from icechunk.testing import VersionControlStateMachine

TestRepo = VersionControlStateMachine("./repo").TestCase

uv run --with "icechunk==2.*" pytest test_stateful.py
uv run --with "icechunk==1.*" pytest test_stateful.py
uv run --with "icechunk==2.*" pytest test_stateful.py

This would go a long way towards confidence in testing, but it still wasn’t very satisfying.

What’s still missing: interleaving and shrinking

This script with different versions strategy still has two serious limitations.

No interleaving. This only performs Icechunk 2.x actions in a batch, then Icechunk 1.x actions. In real life with multiple readers and writers we will see complex chains of interleaved actions by Icechunk 1.x or Icechunk 2.x clients.

No shrinking. This script approach loses us a core Hypothesis feature: shrinking. When you run a Hypothesis stateful test, it may find a bug only after an extremely complicated set of steps to build up state. Not all of which matter for the bug at hand.

For example:

class StatefulTestClass(RuleBasedStateMachine):
    @rule()
    def unrelated_function_1(self): ...
    @rule()
    def unrelated_function_2(self): ...
    @rule()
    def unrelated_function_3(self): ...

    @rule()
    def bug_setup(self): ...
    @rule()
    def bug_thrower(self): ...

The bug discovery path might look like

unrelated_function_2
unrelated_function_2
unrelated_function_2
unrelated_function_2
bug_setup
unrelated_function_2
unrelated_function_2
unrelated_function_3
bug_thrower

You could take this and start manually pruning away, but in real life the functions don’t have helpful names like bug_setup to help us. Fortunately Hypothesis shrinking will automatically prune this list for us, giving a much simpler trace of:

bug_setup
bug_thrower

With our more complicated state machines for Icechunk it is even more important to have shrinking. One of our state machines, which exercises Icechunk’s version control machinery has 18 rules! A few of them:

def commit(self, message: str) -> str: ...
def amend(self, message: str) -> str: ...
def create_branch(self, name: str, commit: str) -> str: ...
def reset_branch(self, branch: str, commit: str) -> None: ...
def create_tag(self, name: str, commit_id: str) -> str: ...
def checkout_commit(self, ref: str) -> None: ...
def reopen_repository(self, data: st.DataObject) -> None: ...
def upgrade_spec_version(self, dry_run: bool, delete_unused_v1_files: bool) -> None: ...
def expire_snapshots(self, ...) -> None: ...
def garbage_collect(self, data: st.DataObject) -> None: ...

Shrinking is only possible when a single Hypothesis engine runs the test case to the point of failure. There is no mechanism for Hypothesis to keep track of the randomized calls across several different processes launched from individual scripts and pare it down to just the combination of moves that caused a bug.

To solve both of those issues what we really need is a centralized stateful test that knows how to interact with a repository using either Icechunk 1.x or Icechunk 2.x. For this to work, a single Python process needs to be able to use either ic1 or ic2. Of course you can’t just install two versions of the same package into an environment.

third-wheel: two Icechunks in one process

To solve these problems I created third-wheel (blog post), which does surgery on a wheel file to fully rename a package (including all internal imports and compiled sources). It allows us to do

import icechunk as ic # v2+
import icechunk_v1 as ic1 # v1.x

This means that we can finally have a single Hypothesis test that performs actions using both Icechunk 1.x and Icechunk 2.x!

Why not subprocesses?

At first glance it might seem simpler to just use subprocesses to solve this problem. And in fact prior to making third-wheel we considered just that approach. A centralized hypothesis process could schedule the stateful actions and delegate them to subprocesses. But this adds complexity, and it seemed like there would be a lot of bugs just in our serializing of the commands via RPC into the subprocesses. Also, I (Ian) have some hard-won lessons from the past on the difficulty of chasing bugs across process boundaries from my time working on microscope control software (pymmcore-plus #17, pymmcore-plus #21). Given that the purpose of these tests was to find bugs, making them easy to trace was a high priority.

A cross-version state machine

Now that we had the ability to run a test with both versions of Icechunk, we still had work to do. We had to make our StateMachine able to swap which version is used, without breaking our tests for one version at a time.

Swapping versions inside the StateMachine

The fix here is surprisingly simple! We allow our base testing class to know what version of Icechunk it is using which we call an actor. Any function that used to use the module level icechunk import now uses self.actor or self.ic instead.

Then we let Hypothesis pick the active actor at the start, and switch it on the fly via a new rule:

import icechunk as ic           # v2.x
import icechunk_v1 as ic_v1     # v1.x

class CrossVersionVersionControlStateMachine(VersionControlStateMachine):
    """Two-actor version control test: one actor is v2, the other is v1."""

    def __init__(self) -> None:
        self.actors = {"v2": ic.Repository, "v1": ic_v1.Repository}
        self.actor_modules = {"v2": ic, "v1": ic_v1}
        ...
        super().__init__(actor=None)

    @initialize(data=st.data(), target=VersionControlStateMachine.branches)
    def initialize(self, data: st.DataObject) -> str:
        choice = data.draw(st.sampled_from(list(self.actors)))
        ...

    @rule(data=st.data())
    def switch_actor(self, data: st.DataObject) -> None:
        """Pick a new active version and reopen the repo through it."""
        choice = data.draw(st.sampled_from(tuple(self.actors)))
        ...

This has the additional benefit in that we are now simulating multiple users interacting on the same repository with random interleaving of their writes!

What a run looks like

An example run of this might look like this:

initialize → v2
commit("init")
create_branch("dev", <commit>)
switch_actor → v1
commit("update on dev")
create_tag("v0.1", <commit>)
switch_actor → v2
amend("update on dev v2")
reset_branch("dev", <commit>)
switch_actor → v1

After each step we also confirm that our simple model of the system matches what both actors are seeing. This confirms that they stay in sync. These tests are very detailed and thorough, but they still don’t do concurrent interleaving. That is, we never have both versions with an open handle at the same time. This limitation is a conscious choice. We are not modeling two concurrent in-flight writers and conflict resolution. This is because the bugs we’re hunting here are format compatibility, not in-flight write races and conflict resolution.

Tradeoffs

Even though this is a quite nice model for running these tests there are still some interesting tradeoffs we had to make in the design. These are tradeoffs that only show up when you have two versions of the same library installed.

Test class has to have multiple library handles

Since we have two versions of the Icechunk library present, we need a way to control which one our state machine uses. Before we had cross-version testing we would do the normal thing: import icechunk as ic then in our class just call ic.some_method(). Now to keep track of which to use we have the test class hold a reference self.ic that gets swapped when we do our actor switching. This is a bit ugly, but ultimately necessary.

Another interesting trip point is capturing exceptions. In the parent class we do pytest.raises(IcechunkError) but ic.IcechunkError and ic_v1.IcechunkError are different classes! So we have to do this slightly horrifying monkey-patch:

import tests.test_stateful_repo_ops as repo_ops_mod

# pytest.raises(IcechunkError) in the parent class catches both versions
repo_ops_mod.IcechunkError = (ic.IcechunkError, ic_v1.IcechunkError)

Full set of changes: PR #1681

Garbage collection across versions

In Icechunk we have a Garbage Collection (GC) operation that can clean up and remove old commits that are no longer relevant to keep around (docs). This is surprisingly complex to get right. Consequently in Icechunk 1.x we have quite a lot of quirky behaviors in GC. For Icechunk 2.x we redesigned this and have a much more consistent and well-designed GC process. But, as we know from all our other efforts, we need Icechunk 2.x to match Icechunk 1.x behavior quirk-for-quirk.

To test this with Hypothesis would require modeling all idiosyncratic behavior in Icechunk 1.x. This is quite annoying and difficult to get right (we know because we tried!). But it is still very important, so instead of trying to capture all the crazy details we pull a clever trick. For our cross-version testing we always use an on-file-system storage. So to test Icechunk 2.x GC behavior on a format v1 repo we:

Copy our test repo to a temporary directory
Perform GC with icechunk_v1 on the copy
Perform GC with icechunk_v2 on the original
Compare that they are identical

This works because we already have a perfect model of the software we are trying to test: our old Icechunk 1.x is a perfect model of itself!

Full set of changes: PR #1979

What it caught

Fortunately due to our careful engineering as our first line of defense there were not a ton of bugs to catch! However, these tests did find several subtle things that we likely would have missed without them:

Issue #1519: delete_unused_v1_files=False causes v1/v2 compat errors
PR #1757: upgrade_icechunk_repository called twice panicked
PR #1932: shared-tip branches wouldn’t expire under 2.x on spec v2

Looking ahead

While at first it might seem like a gimmick to install two versions of a library into one environment it turns out to be an incredibly powerful testing pattern for any library with an on-disk format. It is part of our continuous effort to ensure that Icechunk is thoughtfully and deeply tested, which we have found to be necessary to be able to build at the pace we have been. We take the responsibility of never corrupting data in our format extremely seriously so we view maintaining these tests and building new ways of testing Icechunk to be a core part of our work.

We are very proud to report that so far we have had NO reports of issues with v1/v2 compatibility.