Everything you need to know about Icechunk garbage collection

We will talk about two powerful Icechunk operations: expiration and garbage collection. They are related, so we usually refer to both under the name of garbage collection or simply GC. We will explain what each of them does, why you may want to use them, and how to do it safely and effectively.

The two operations are available in the open-source Icechunk library. Hopefully, this blog post will help you understand and use them on your own Icechunk repositories.

In the next few weeks, garbage collection support will become generally available in the Earthmover platform. You will have the option to let Earthmover handle GC, instead of having to execute and understand every detail yourself. Nobody understands Icechunk garbage collection better than the team that built it at Earthmover. It will also be free for users of the platform.

Motivation

A few days ago, a customer showed us the following storage chart for one of their buckets storing Icechunk repositories.

It makes us proud to see our customers using advanced Icechunk features. Those periodic drops in total storage are them running expiration and garbage collection on a repository. But we can’t avoid noticing that the graph doesn’t show an optimal usage of these operations. Expiration and garbage collection are far from trivial. It’s not easy to understand when they are needed or how to run them. Icechunk’s documentation is quite behind on these aspects, too; we need to improve it. Hopefully, this post will clarify things.

What is garbage collection (GC)

Icechunk implements its consistency guarantees using a mechanism called Multiversion Concurrency Control. In this algorithm, data is written as the session (in MVCC-speak this is called a “transaction”) makes progress, but it’s only made visible to other sessions when the commit happens.

Imagine you start a session to write 100GBs of array chunks to Icechunk. The process runs for minutes, but it dies before being able to commit. Those 100 GBs of chunk data were written to the object store, but since a commit never happened, they are not “linked” to any snapshots, branches, or tags in the repository. Those chunks are simply inaccessible from any Icechunk references (branches or tags). The data is written, but it cannot be retrieved using normal Icechunk features. This is a common and well-understood problem that happens in many other types of databases, too. See, for example, VACUUM in PostgreSQL databases.

Garbage collection is the process that deletes from object storage any objects that are “dangling”, or inaccessible to Icechunk. If you run garbage collection with the proper configuration in the scenario above, you would recover the storage space, lowering your cloud storage bill. After GC is run, the Icechunk repository will use only the space it needs to represent the data it contains.

GC is an unrecoverable operation. Dangling objects are deleted for good. There is no rollback or undo. Unlike most other Icechunk operations, GC is not part of a transaction:

It cannot be undone by resetting a branch,
Its effects are not observed atomically,
It’s destructive; it “edits” the contents of the repository without creating a new version.

For these reasons, GC should be considered an administrative operation and executed carefully. We’ll explain later in this post how to execute GC safely.

Do you need GC?

We mentioned above that one of the reasons to run garbage collection is to delete data that was written but never committed.

There are a few more reasons. When you delete branches or tags, data that was previously accessible can become inaccessible. Imagine a repository with the following version structure:

If you decide to delete both the tag temp and the branch develop, commits 4, 5, and 6 become dangling: they can no longer be reached from any branches or tags. Notice, though, that if you only delete one of those refs, the branch or the tag, the commits can still be accessed from the other, so they are not considered dangling.

Once you have deleted both refs, you may want to recover the space used by the objects written in commits 4, 5, and 6. If you cannot access those versions, why would you keep paying for that space? If GC is run on this repository, with the proper configuration, the objects written by those three commits will be deleted and the space recovered.

A third, and probably most important, situation in which you may want to run GC is if you have run expiration. We’ll learn about this case in the next section.

Garbage collection is about recovering storage that Icechunk no longer needs to represent the repository. Garbage collection has no effect on latency or scalability of the repository. You should only run GC if you need to lower the amount of space used, and if your repository has any space that can be recovered. If your repositories use 10 GBs of storage, you should probably stop reading this post. The time you’ll spend here will cost your company more than storing those extra uncollected objects for a long time.

What is expiration

In each branch, Icechunk maintains the full list of all previous commits. Each commit (a.k.a. snapshot) has a parent snapshot, which has a parent snapshot, all the way until the first commit in the repository. This first commit is automatically generated without a parent when the repository is created.

This behavior of retaining each and every change to the repository is exhaustive, but it may not be ideal in all situations. Having full version resolution can make it hard to understand which changes are actually important, and in certain situations that we’ll explain, it can require too much storage space.

Expiration deletes old versions of the repository. After running it, data that was visible only in those older versions can no longer be retrieved. It is somewhat similar to Git’s squashing.

Let’s see why and when something like this would be helpful.

Do you need expiration?

Not every repository needs expiration. But there are many situations that can benefit from it. Here we list a few examples:

Bug in array data: You write a big array to your repository. Later, you realize a bug in your code, the values you wrote in the array are wrong. You fix the algorithm, you write the array again, overwriting with the correct values. Everything is fixed for consumers of the repository if they get the latest commit. But all the chunks used to write the initial buggy array are still in storage. They are there by design, you can always go back and read that version of the array, Icechunk allows you to time travel in that way. If the size of the array is significant, and you never plan to read the buggy version, you don’t want to have to pay for the storage. Sometimes this can be fixed using Repository.reset_branch, but if you want to fix an old version, with many later versions on top, resetting the branch doesn’t work.
Deleted array: You try an experiment that creates a new big array. You don’t like the result, so you decide to delete the array. If you did this in a temporary branch, say test-new-algo, you can just delete the branch, run garbage collection, and all the space used by that array will be gone. But what happens if you forgot to change branches? You can’t delete main. Deleting the array will only generate a new snapshot where the array cannot be accessed. But the space used by the array is still needed. Again, Icechunk has to keep it around to allow you to go back and read that snapshot, or tag it, or create a branch off of it.
Temporal arrays: A common case of the previous pattern happens when a new array is created for each new data ingest. It’s somewhat common for people to append a new array to the repository instead of adding data to the time dimension of an existing array. If the user wants to maintain only the data for a given window of time, say 3 years, they are faced with the same issue. They can delete older arrays, but they cannot free their space.
Chunk-unaligned writes: Temporal data is ingested at high cadence to an array with a large time dimension chunk size. Say daily ingests to a monthly-chunked array. On every ingest a full monthly chunk needs to be rewritten replacing fill values by the daily slice of data. The previous chunks cannot be deleted; again, Icechunk allows going back to previous versions of the repository. As time passes, there is a storage explosion; we have ~30 copies of each chunk, basically with the same data, modulo fill values.

What users in this situation want is a way to “lose” versions older than a certain threshold to be able to recover space. For example, in the last scenario in the list, users may want to keep full version resolution for six months. That gives enough time to find and fix any issues, versions that are older than that don’t add much value, and the cost of storing them cannot be justified.

The key point is: we can’t allow data loss in the repositories, we still want to access the data that was written months or years ago. However, to achieve significant storage cost savings, we are willing to sacrifice the ability to retrieve older versions of data that have been later overwritten or deleted.

Expiration doesn’t delete array data, it only rewrites the version history skipping any versions that are older than the indicated threshold. If you call Repository.ancestry before and after calling expire_snapshots, you will see the oldest versions of the repository may have disappeared. The invariant expiration holds is: arrays and groups visible in versions more recent than the threshold are element by element unchanged. You don’t lose any data present in the newer versions.

Importantly, if you care about a specific version, even if it’s very old, you can simply create a tag pointing to it. In that way, it will never get expired, even if it is older than the expire_snapshots threshold. This is useful for situations where you don’t really care about old versions except for a few that need to be maintained unchanged. Everything that is old and not pointed by a tag can be expired.

Expiration “generates garbage” by turning some old and overwritten data inaccessible or “dangling.” After expire_snapshots is run, all array chunks that were only accessible by versions older than the threshold become dangling. If you remember from the section on garbage collection, deleting dangling objects is exactly what GC is for. With a combination of expiration and GC, repositories like the ones in our list of scenarios can be maintained with a constant overhead over the actual data. Instead of continuously growing, the size of the overhead can be fixed based on how many days/weeks/months of previous versions want to be maintained alive.

With expiration, Icechunk users can tune the tradeoff between storage and time travel as they see fit.

GC and expiration safety

Icechunk is cloud native and decentralized. To function, it only requires the object store; no other services or coordinators are needed. At the same time, it must maintain excellent performance for enormous repositories with hundreds of millions of chunks or more. In this type of scenario, operations need to be parallelized and distributed. There is no central control that can “stop the world” as expiration or GC happens, there is no possibility for checks before writing data, and we cannot afford two trips to the object store. This also means Icechunk has no locks. At any point in time, multiple processes could be reading or writing to a repository, even while GC or expiration runs.

To be able to implement garbage collection and expiration under those conditions, Icechunk makes a few assumptions you need to take into account:

No more than one GC or expiration operation is running at once.
For GC: garbage_collect(delete_object_older_than: datetime):
- delete_object_older should be before the start time of the oldest concurrent write session. This ensures the objects that are being written while GC runs don’t get deleted.
For expiration: expire_snapshots(older_than: datetime)
- older_than should be before the timestamp of any versions concurrently being read, listed, or used as a commit parent. This ensures other sessions are not using a version that is about to disappear.

How to run expiration and GC safely

There is an easy way to make sure you never violate the conditions of the previous section: don’t call GC or expiration with recent dates, and don’t call them more than once at the same time.

There is no good reason to run either of these operations too frequently or with recent timestamps. The storage savings would be marginal in the vast majority of cases.

So, what do we recommend to our friendly user who submitted the storage chart at the beginning of the post?

Run expiration every 1 or 2 months, not every 12 hours.
Select older_than such that you really no longer care about those versions, for example

Python

Run GC every 15 or 30 days.
Use some value for delete_object_older_than that is much larger than your longest runaway session. For example:

Python

Summary

Icechunk’s garbage collection frees storage by deleting “leftover” objects that are not accessible from any branches or tags.
These objects can be the byproduct of uncommitted sessions, deleted branches/tags, or expired versions/snapshots.
Icechunk’s expiration deletes old versions that are no longer needed. Data that is only accessible from those versions (because it was deleted or rewritten in later versions) will become inaccessible.
Creating a tag or branch on a snapshot protects it from expiration, so it can be preserved for as long as needed.
Expiration can generate “garbage” that can then be freed by GC.
Never run expiration or GC passing recent dates. These operations cannot be undone and have the potential to destroy your data if not properly run. Thresholds passed to both functions need to be larger than the longest possible session, but it’s good to have a very large security margin.
Earthmover will soon launch support for automatically running expiration and GC on your repositories. As a free service for users of the platform.

Blog