Exploring Icechunk scalability: untangling S3’s prefix story
We at Earthmover recently released the Icechunk tensor storage engine, a novel cloud-optimized storage format and library for large-scale array data. Built on Rust’s tokio async runtime, Icechunk delivers impressive gains in performance over today’s array storage engines (e.g. Zarr V2, netCDF).
The goal of this post is twofold:
- Explain how S3 scaling and prefixes actually work, a topic which is still barely documented and obscure to many developers.
- Demonstrate that Icechunk is able to scale S3 I/O horizontally into the hundreds of thousands of requests per second or more.
Motivation
A few days ago, we got the following comment from a friendly Icechunk user (paraphrased):
I noticed Icechunk places all chunks for a repo into a single prefix. This will cause an Icechunk repo to max out at 3,500 req/sec for writes and 5,500 req/sec for reads. We are trying to keep our GPUs filled with work, so I imagine we’ll hit these thresholds.
I’m sure this concern is shared by other people. Particularly, because it’s not at all obvious how S3 prefixes work, even after reading the documentation.
S3 Prefixes and Automatic Sharding
S3 is basically a massive key-value database that you query via HTTP calls. The internal details of S3 are mostly hidden, but somewhere under the hood there are just regular servers and hard drives. And, like any distributed database, S3 has to make decisions about how to spread data over multiple nodes. This is called “sharding” (or “partitioning”). There are physical limits to how much data a single node can store and, more importantly, how much data it can serve efficiently.
This is what the S3 documentation says about those limits:
your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix.
Here a “partitioned prefix” clearly corresponds to a shard.
Icechunk, like many other cloud-native table formats (e.g. Apache Iceberg, Delta Lake, Lance) stores multiple data files in a single “directory” (or a common prefix) in the object store:
$ aws s3 ls earthmover-test-seba/high-concurrency/chunks/
2025-04-07 12:29:55 8 0000V2SC49BRJ8FDA9PG
2025-04-07 12:31:17 8 00024GNAQNV5CVJC2EYG
2025-04-07 12:27:08 8 0004ACMGGC353RFTRXXG
2025-04-07 12:27:52 8 0004FGY7N48RAS7AN480
2025-04-07 12:28:11 8 00052TC640R8CJXXFS50
2025-04-07 12:32:04 8 0005TJBENRAYZ76VFNW0
2025-04-07 12:30:17 8 00079Z58CX4H5WRYS200
2025-04-07 12:29:21 8 0007GJ1B27BHCM46BMT0
2025-04-07 12:32:18 8 0007MAQWCNWG77N8TEPG
2025-04-07 12:29:46 8 00087E38Q3XKYEKHVJ70
2025-04-07 12:29:30 8 00087HGA2AEC6CX3TBDG
2025-04-07 12:31:00 8 0008FPNT3HMGVCQZ8HCG
2025-04-07 12:32:28 8 000923YK10QNY1QKYRHG
...
So does that mean Icechunk can not execute more than 5,500 chunk reads per second on S3? You can find a lot of content in StackOverflow and blog posts that will reinforce these fears.
But no, of course not. We have designed and implemented Icechunk to go far beyond this scale. We have tested Icechunk at hundreds of thousands of reads per second. That’s why we built Icechunk as a fully distributed Rust library. We want users to be able to leverage the full power of the cloud by scaling out I/O horizontally as much as they need.
The key to the solution is in the definition of what S3 calls the partitioned prefix. S3 documentation could be a lot more clear about this, but let’s read it in detail. A prefix is defined in this page of the user guide:
A prefix is a string of characters at the beginning of the object key name. A prefix can be any length, subject to the maximum length
Then, in the best practices page, they say:
There are no limits to the number of prefixes in a bucket.
See how there is no mention of directories, or special characters, or separators? The usual /
character is not special in any way. This means S3 can decide to create a partition or shard at any prefix, including in the middle of a string, and not necessarily at a /
edge.
This wasn’t always the behavior in S3. Before 2018, prefixes were limited to the initial few characters of the key. Then the use of the same prefix high-concurrency/chunks
for all chunks would have limited Icechunk’s ability to scale horizontally.
At least since the 2018 announcement, S3 creates prefixes and “shards” for you, not paying any attention to the characters in your keys. As load hits the bucket, S3 will distribute your objects across multiple servers to be able to sustain the load. This doesn’t happen instantaneously, but it’s very effective.
S3 doesn’t document any way to generate the shards in advance. Icechunk’s format follows S3 recommendations, by uniformly distributing its chunks over the key space after a prefix. This is enough to get the dynamic generation of the optimal key distribution. At least optimal according to S3’s algorithms.
This all means, in the same Icechunk repository, S3 could decide (and will for enough load) that the ideal prefixes are:
my-repo/chunks/0
my-repo/chunks/1
my-repo/chunks/2
...
my-repo/chunks/A
my-repo/chunks/B
...
For a larger repository, or one with heavier concurrent access, the prefixes could become:
my-repo/chunks/00
my-repo/chunks/01
my-repo/chunks/02
...
my-repo/chunks/A0
my-repo/chunks/A1
...
This “new” S3 behavior is also shared by other object store implementations. Icechunk uses these object stores in the most efficient way possible, achieving the highest concurrency levels they can offer. Icechunk and other cloud-native formats can safely store many data files in a single “directory” without limiting concurrency and scale.
Benchmarking S3 Concurrency Using Icechunk
We have written some sample code to demonstrate these concepts. You can use this script to throw as much concurrent load at an Icechunk repository as you want. Don’t forget there is a per-request charge in most object stores, and at this concurrency, those small charges add up to tens of dollars pretty quickly.
With the script, you can create a test repository in S3, R2, or Tigris. You can populate that repository with thousands or millions of small objects, and then you can throw a lot of read load at it.
The script uses Coiled to launch multiple machines and scale the load. You will need an account and login for Coiled. Any other mechanism for distribution would work too; currently, we only support a local cluster and Coiled in the script (PRs welcome).
# Write a million tiny chunks using two workers
# This will take a few minutes, feel free to increase number of workers
python ./high_read_concurrency.py \
--cloud s3 \
--bucket my-bucket \
--prefix my-bucket-prefix \
--bucket-region us-east-1 \
--coiled-region us-east-1 \
--cluster-name high-concurrency-tests \
--workers 2 \
create-repo --chunks 1000000
# Read concurrently from 50 machines for 60 seconds
python ./high_read_concurrency.py \
--cloud s3 \
--bucket my-bucket \
--prefix my-bucket-prefix \
--bucket-region us-east-1 \
--coiled-region us-east-1 \
--cluster-name high-concurrency-tests \
--workers 50 \
read --duration 60
The script will print the number of concurrent writes/reads it achieves.
If you get 503-SlowDown
errors, this means S3 is resharding your bucket to provide more concurrency. Give it a few minutes and try again.
Using the script, it’s easy and relatively inexpensive to reach hundreds of thousands of reads per second. For example, using 200 small machines (--workers 200
) we got S3 to sustain 230,000 reads/sec. The absolute value is not the point here, this script is not optimized at all, it doesn’t even try to go fast. The important part is that Icechunk enables scales far beyond the single shard S3 limit. Feel free to play around with the script and try other object stores.
Icechunk scales!
Icechunk is built from the ground up as a cloud-native data format for high concurrency tasks over massive datasets. You can scale your workloads to the limits of your object store and computational resources. Icechunk won’t get in the way. It won’t introduce extra latencies, locks or bandwidth limitations. For well designed workloads, the limiting factor of your I/O will probably be your network card.