Ship fast, break nothing: Engineering rigor in Icechunk with concurrency & fault injection testing

Between the release of Icechunk V1 and Icechunk V2, we faced a significant challenge: be sure we didn’t break anything we built the first time :P. We needed to be sure that

we were shipping correct code; and
our users could upgrade their Icechunk library and their Repositories without fear.

This second post in a three-part series (first post on property and stateful testing) describes how we added even more testing to engineer greater reliability!

Concurrency permutation testing with `shuttle`

A core upgrade to the Icechunk format in V2 is the repo info file — think of this as containing a “table of contents” to the Icechunk Repository. This object is a single point of consistency for all operations: it records information about snapshots, branches, tags, recent operations, feature flags, and more. Every Icechunk Repository operation involves an update to this object. Thus updates to that object must be consistent, atomic, and race-condition-free under high concurrency.

Race conditions — bugs that only appear when concurrent operations happen to run in a particular order — are notoriously hard to reproduce because there’s no easy way to force tasks into the exact order that triggers the bug. Here we steal a page from the AWS S3 teams’ book and implement concurrency permutation testing using the shuttle crate. The idea is simple: take control of the task scheduler, then systematically explore different orderings of concurrent operations to find the ones that break (Burkhardt et al., 2010). shuttle recently added support for tokio, the async runtime used by Icechunk, which made it relatively straightforward to wire up.

This deceptively simple test looks for race conditions when parallel writers are committing to different branches (code):

use futures::future::try_join_all;
use shuttle::future::{block_on, spawn};

// this method gets executed repeatedly by `shuttle` with different orderings
async fn mk_concurrent_commits_same_branch() -> Result<(), Box<dyn Error + Send + Sync>> {
    // initialize repo with a single array
    repo.create_branch("feature", &snaps[0]).await?;

    // create two tasks: one will commit to the 'main' branch
    // and the other will commit to the 'feature' branch
    // both tasks are spawned using a `shuttle` spawn primitive
    let handle1 = spawn(async move { mk_commit(repo1, path1, "main", 2).await });
    let handle2 = spawn(async move { mk_commit(repo2, path2, "feature", 1).await });

    // here shuttle will execute the async methods called in the two tasks
    // in different order for each run
    for r in try_join_all([handle1, handle2]).await? {
        r?;
    }
    // assert invariants on the ops log
}

That test is a lot more subtle than it looks. With its support for the tokio runtime used by Icechunk, shuttle can choose to switch to a different task (or not) at every await point in the code. In other words, there aren’t just 2 orderings of the two tasks. Because both tasks are async and there are many awaits in each task, there are many possible orderings of the async function calls within the two tasks. shuttle explores the space of possible orderings randomly, using a chosen scheduler, allowing it to discover and reproduce bad orderings of tasks.

That test caught the following bug. Icechunk maintains an operations log (“ops log”) that records every repository operation with a timestamp. It is vital that the entries in the ops log are sorted newest-first, this property is exploited in other sections of the code. The bug: each task picked its timestamp in the flush_snapshot method, then later wrote to the ops log in the update_ops_log method. Each of those methods yields control (via await), so another task can modify the ops log in between. With concurrent commits to branches main and feature, this buggy logic lets feature write to the ops log between main’s flush and main’s write — so main’s older timestamp lands on top, breaking the sort invariant. The figure below illustrates a case where the executions happened to be ordered correctly (“Ordering 1”), and buggily (“Ordering 2”).

Diagram showing a buggy ordering where ops-log timestamps are unsorted, and the correct behavior after the fix

The fix was simple: pick the timestamp when we write the snapshot file, not when the flush_snapshot is called. Now the write order determines the timestamp order, regardless of execution order.

Diagram showing a the correct ordering behavior after the fix

While we detected this particular bug manually (PR), we verified that shuttle quickly and reproducibly detected the error.

Here is a more complex test (code):

async fn execute_concurrent_actions(
    actions: Vec<Action>,
) -> Result<(), Box<dyn Error + Send + Sync>> {
    // We set up the repo with a single array and make a commit

    // then we take a Vec of actions to execute and pass that on to `shuttle`
    // using their `spawn`
    let handles = actions
        .iter()
        .zip(branches.iter())
        .map(|(action, branch)| {
            spawn({
                async move { execute_action(repo, action, branch).await }
            })
        })
        .collect::<Vec<_>>();

    // execute actions in a concurrent order set by the shuttle Scheduler
    let results: Vec<_> =
        try_join_all(handles).await?.into_iter().collect::<Result<Vec<_>, _>>()?;

    // assert invariants on the repo info object
    // example: all entries in the ops log are ordered
    // last entry (earliest) should always be RepoInitializedUpdate
}

We defined a number of actions that could result in ‘interesting’ outcomes when executed concurrently:

enum ActionResult {
    Commit(String, SnapshotId),
    AddBranch(String, SnapshotId),
    DeleteBranch { branch: String, previous_snap: SnapshotId },
    AddTag(String, SnapshotId),
    DeleteTag { tag: String, previous_snap: SnapshotId },
    Amend { branch: String, new_snap: SnapshotId, previous_snap: SnapshotId },
    ResetBranch { branch: String, to_snap: SnapshotId, previous_snap: SnapshotId },
}

But which ones do we choose to find interesting failures?

We could run all of them concurrently, but that is an enormously large space to explore.

Instead, we can use the property testing framework proptest to randomly generate smaller subsets of actions for testing with shuttle! The following snippet has proptest generate an arbitrary sequence of 3-5 actions from the defined list; and executes them using one of shuttle’s schedulers.

proptest! {
    #[test]
    fn concurrent_actions(acts in actions(3..=5)) {
        check_pct(move || {
            block_on(execute_concurrent_actions(acts)).unwrap();
        }, 100, 3);
    }
}

Fault injection with toxiproxy

Cloud networks can be notoriously flaky, and client libraries need robust fault handling and retry logic. At Earthmover, we run a battery of integration tests against all object stores (S3, GCP, Tigris, R2) every hour. With that frequency, these tests surface all kinds of edge-case behaviour we need to guard against. For example, Icechunk V1 users were plagued by this cryptic “0B/s” or “stalled network stream” error in high concurrency scenarios:

...
  |-> error streaming bytes from object store streaming error
  |-> streaming error
  `-> minimum throughput was specified at 1 B/s, but throughput of 0 B/s was observed

The underlying AWS S3 SDK raises this error when it detects that no bytes have been transferred across the network over a set period of time. The expectation is that the client retries the request. We weren’t doing so. Well, we intended to, but it turns out that retry settings in the AWS S3 SDK only apply when the connection is set up, not while streaming the bytes after the connection is set up. It was also hard to understand the failure and build better behaviour because we couldn’t reliably reproduce the error, and it usually occurred in scenarios with high and nested concurrency (e.g. dask multi-threading + Zarr async concurrency). Our response to such errors has usually been a very unsatisfying “turn down concurrency, maybe bump retries” (yuck!).

Not any more! Icechunk now uses toxiproxy to simulate flaky network connections and help us design better retry and fault handling behaviour. We sit toxiproxy in front of a local rustfs object store deployment, letting us poison the connection at will — limiting data transfer to N bytes, injecting latency, or killing streams mid-flight — and reproducibly verify retrying behaviour. Claude helped us code up a reproducer using a specific combination of LimitData and SlowClose toxics provided by toxiproxy (code). With that in hand, we coded up better retrying behaviour (PR) which means that our users should never see that error as long as the network heals fast enough (within a minute).

This framework is now helping us reproduce and build better behaviour for more arcane failures that keep showing up. Icechunk 2 is thus a lot more reliable than it used to be.

Looking ahead

We have consistently invested in testing approaches and infrastructure over Icechunk’s lifetime. We focused quite a bit on implementing high-leverage randomized testing frameworks. Not that we ignored our unit testing habits, but a spectrum of approaches has let us move fast, while being correct (mostly anyway), and also have a lot of fun!

If these ideas are interesting to you, you could help us build out a more feature-rich test suite! There are many more property tests to be written, and the toxiproxy and shuttle tests are very minimal at the moment. Come help out (shuttle tests, toxiproxy tests)!

The next post in the series will describe how we leverage our investment in testing to build confidence in the backwards compatibility of Icechunk 2.

Ship fast, break nothing: Engineering rigor in Icechunk with concurrency & fault injection testing

Concurrency permutation testing with shuttle

Fault injection with toxiproxy

Looking ahead

Concurrency permutation testing with `shuttle`