Icechunk: Efficient storage of versioned array data
We recently got an interesting question in Icechunk’s community Slack channel (thank you Iury Simoes-Sousa for motivating this post):
I’m new to Icechunk. How is the storage managed for redundant information between different versions of a data repository?
Icechunk keeps your data versioned, allowing you to “travel back in time” and look at previous versions of your arrays.
This blog post will show that Icechunk can do this efficiently, without having to store redundant copies or rewrite parts of the data.
Let’s measure
We are going to write some code that will help us measure how much space Icechunk uses for the storage of multiple versions.
For this experiment, we are going to use Xarray, but the same could be achieved (and probably with less code) using Zarr‘s API directly.
You can find the whole code for this post here
Array initialization
We need to create an Icechunk repository with a single array in it. Everything in this post would remain the same with multiple arrays, of any size and number of dimensions, but for demonstration purposes we choose a 3D array (x
, y
, t
) with size and chunking as:
We usually refer to this type of chunking as pancakes, a thin chunk across the time dimension, wider in space.
We’ll initialize an array to those sizes, 10 chunks wide in x
and y
, and 5 chunks across t
, for a total of 500 chunks and 20M floating point elements in the array.
To write the array to disk we’ll use a local filesystem storage, but any other object store supported by Icechunk would have the same behavior: S3, GCS, R2, Tigris, you name it.
Nothing too interesting here. We initialize the repository with a configuration that disallows inline chunks, this is to be able to count chunks by looking at the files in the filesystem. We then create a random array with the specified dimensions, and we store it to disk using Icechunk’s to_icechunk
that takes an Xarray Dataset
. For the coordinate arrays, we choose a large chunking value to make analysis simpler later.
After running this code, Xarray will write to the Icechunk repo four different arrays at paths
/x
/y
/t
/array
Since we wrote to a local filesystem store, it’s easy to find the chunk files. In the Icechunk on-disk format, the chunks are below a chunks
prefix:
$> ls -lh /dev/shm/versioned-storage/chunks/ | head
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 00N0Y5FGWH4TQAJBZ4EG
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 00ZNABHJK5QY5NK5A8TG
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 0169VJ3JFMV0KMX6K580
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 01XWFM074TADHJBVSMF0
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 0626FVAXCRDG0M14NXG0
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 08P6QKC2GBDASA6ECGB0
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 0EC8KWBXGHVGDX598NYG
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 0H3EV0ZTT6N4NRPB30TG
-rw-rw-r-- 1 seba seba 294K abr 26 17:23 0JF2D09T3W8JGD5EWN90
...
With this information, we can write a quick function to count chunks and total space used by the repository
The output from this function on the previous repository is
Number of chunks: 503 - Total size: 143.26 MB
The 503 chunks come from:
- 100 chunks per time step (pancake), because each one is 2,000×2,000 elements in chunks of 100×100
- 5 time steps, so 5 pancakes
- 3 extra chunks for the coordinate arrays (
x
,y
,t
)
The storage size includes the space for the chunks, and a tiny overhead that Icechunk uses to keep track of metadata information (versions, hierarchy) and allow high level features (time travel, diff, rebase, etc.).
Small updates
Let’s try to answer the original question. What happens if we make a small update, will Icechunk make a copy of all the data for the new version, duplicating the number of chunks and storage required?
We are updating one single time slice (remember, we have 5), right in the middle of the array. To do this we use the region
argument to to_icechunk
. This argument works just like the equivalent one in to_zarr
. The new values for that time slice, are random numbers [0, 42)
instead of [0, 1)
And now, let’s check the resulting number of chunks and storage space.
Number of chunks: 603 - Total size: 171.92 MB
We have written only 100 new chunks, and we are using 20% more space. This is exactly what we would expect from a write of a single pancake, which takes 100 chunks. We have written only the new chunks, no chunks were rewritten so the operation is fast and inexpensive. We are using only the extra space needed for the new chunks.
Without paying a price in extra storage or performance, Icechunk allows us to time travel and read either version of the data:
We first compute the max value for the latest repository version, with the updated slice. As expected, we get something very close to 42. Then, we can navigate the history of the repo for the previous version, and repeat the operation. We recover the value we had before the latest commit.
Trying to do something like this, without Icechunk, would require the full dataset to be copied, and one of the copies updated. This would more than double the storage requirements on every update.
Of course, creating new versions that only change metadata doesn’t write or rewrite any chunks, so it’s essentially “free”:
This function creates and verifies a new commit that changes only the array metadata. After calling it, we still get:
Number of chunks: 603 - Total size: 171.92 MB
Small appends
The same behavior also holds when, instead of updating data, we add new data to the array. Let’s imagine we got a new slice of data, a new pancake for t=5
. We will add it to array
Now what we need is the append_dim
argument to to_icechunk
. This instructs the function to not override the existing data in the array, but instead append the new data along a given dimension.
The result of calling print_stats
is:
Number of chunks: 703 - Total size: 200.60 MB
Again, we added only 100 new chunks, and no chunks were re-written. Storage size increased only by the size of the new chunks.
Branches and tags
Creating new branches or tags has no impact on storage. They both write a tiny 35 byte object in the object store.
Critically, updates and appends to different branches do not incur additional storage or performance overhead. The prior examples would have yielded identical storage usage regardless of the target branch. Creating exploratory branches to test new possibilities incurs no downsides; discarded branches can later be effortlessly deleted.
The conclusion is: Icechunk enables new repository versions, even across different branches, to transparently reuse and link to previous data without copying or rewriting, all while preserving the ability to time-travel to any previous version.
The power of versioned data
The ability to inexpensively maintain versioned data is transformational. Time travel plus Icechunk’s branches and tags, completely change the way teams can interact with their data. Making updates is no longer dangerous (or even worse, inconsistent). Teams can create new branches to try new things. Those writes are isolated from other writers and readers.
Some teams chose to use production
and dev
branches. Some organizations use per-team branches and use tags to mark their production data. Some users have personal branches to explore new algorithms on their own.
As we just saw, all of this can be achieved without expanding the storage beyond the newly written data, and without any slowdowns or extra cloud costs for rewrites. It’s a superpower for data science teams.
Abandoning versions
There is a write pattern we haven’t covered yet: what happens with frequent chunk updates? If an array is frequently updated without modifying its extents, chunks will be rewritten in each update. As time passes, very old versions will accumulate and with them, storage will grow.
This is of course, by design, we generally want to maintain all those versions. Icechunk guarantees you can always go back and see what the data looked like at any point in the lifetime of the repository.
But sometimes, storing all past versions is not desired. Some teams want to retain past version for a while, this allows them to go back and repair issues as they are found, but eventually, they no longer need very old versions. For example, a team may decide to keep only the last three months of versioned data. That gives them enough time to find errors.
Icechunk enables this pattern. Using expiration and garbage collection, we can ask Icechunk to “release” older versions, and delete any storage they required. The data for the newer versions will still be available, doesn’t matter when it was written. Anything readable from the new versions will still be readable after expiring older versions. It’s just that old data that was overwritten by later commits, can be released and cleaned, to save on storage.
Let’s try it on our repository:
Remember that before running expiration and garbage collection we had
Number of chunks: 703 - Total size: 200.60 MB
What happened here? We expired all version but the latest. All the data in the latest version is still accessible, so all original writes are still there, but remember those 100 chunks we updated from [0, 1)
to [0, 42)
? The old version of those chunks is gone. Icechunk released any space it doesn’t need to represent the latest version. It’s now using exactly the same space it would use if you wrote the whole array in a single version.
What this means for the frequent update pattern, is that you can decide the trade-off between historic data availability and storage usage. Even more, that decision can change over time and it can be done after the data was written.
This is not the full detail on expiration and garbage collection. There is much more we could say (in a future post maybe?). We wanted to mention it here to remind users that, not only Icechunk maintains historical versions without overhead, and without rewrites; but also, it provides a knob to tune the storage/versioning balance according to your needs.
Summary
The answer to the question about “redundant information between different versions” is:
- Icechunk never copies or rewrites your data
- The storage used to maintain versions is simply the storage needed for the new data you write in each version
- There is minimal overhead in storage, used by Icechunk to maintain the versioning information. This overhead is negligible compared to the size of the data in real-world usage
- Icechunk doesn’t force users to maintain all past versions of a repository. When storage is a concern, older versions can be expired and their storage released
Icechunk transforms the way in which data teams update data. Teams no longer need to coordinate reads and writes so they don’t conflict. Teams are no longer at risk of “breaking” the dataset. They can always discard a commit, or abandon a branch. Icechunk gives data scientists and engineers the ability to freely experiment with their data, by bringing to large multi-dimensional arrays tools from the database and version control worlds.