Block-based immutable list implementation (GSoC proposal) by Zayd-R · Pull Request #809 · typelevel/cats-collections

Zayd-R · 2026-03-23T15:10:05Z

Summary

This is an early-stage implementation of the block-based immutable list
proposed in #634, submitted as part of GSoC work. The goal of this PR is to share the implementation and benchmark data to explore the design space before committing to a final approach.

Two implementations explored

BlockedList — copy-on-write
Every prepend into dead space copies the valid portion of the block before
writing. Fully persistent and safe for branching use cases.

FastBlockedList — write-direct
Prepend writes directly into dead space (offset - 1) without copying,
since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail. **(Deprecated) **

Both implementations store BlockSize per node to test it with many sizes in the benchmark file without
recompilation.

Results

All times in ns/op. Lower is better.

prepend (build a list of 10k elements from empty)

blockSize	BlockedList	FastBlockedList	scala.List
4	138,526 ± 4,810	52,328 ± 218	40,419 ± 114
8	167,084 ± 1,601	54,333 ± 228	40,460 ± 296
16	193,820 ± 6,885	56,385 ± 341	39,556 ± 100
32	246,475 ± 2,368	54,803 ± 337	39,600 ± 311
64	378,280 ± 74,722	53,403 ± 226	39,592 ± 164

Copy-on-write prepend scales linearly with blockSize due to the arraycopy cost.
Write direct prepend is flat across block sizes and only ~30% slower than scala.List.

foreach (visit every element)

blockSize	copyOnWrite	noCopyWrite	scala.List
4	12,198 ± 55	14,418 ± 154	21,800 ± 287
8	7,674 ± 34	9,884 ± 130	21,658 ± 102
16	7,327 ± 142	8,259 ± 160	21,735 ± 162
32	6,360 ± 14	6,533 ± 73	21,989 ± 197
64	5,402 ± 11	5,640 ± 31	21,517 ± 194

copyOnWrite at blockSize=64 is ~4× faster than scala.List.

foldLeft (sum all elements)

blockSize	copyOnWrite	noCopyWrite	scala.List
4	23,763 ± 71	27,614 ± 159	28,416 ± 268
8	31,633 ± 81	26,432 ± 83	26,904 ± 365
16	32,133 ± 507	25,902 ± 74	27,862 ± 404
32	24,442 ± 100	25,630 ± 77	27,272 ± 325
64	23,888 ± 65	24,672 ± 80	27,434 ± 392

After the @tailrec refactor , the numbers are close with a win for the CopyOnWrite at BlockSize=32 and 64
the previous stack recursive implementation was offsetting the cache advantage.

uncons (element-by-element traversal)

blockSize	BlockedList (cow)	FastBlockedList	scala.List
4	82,950 ± 6,042	70,404 ± 488	16,328 ± 202
8	84,576 ± 8,152	76,493 ± 545	16,232 ± 97
16	82,985 ± 14,101	74,295 ± 313	16,284 ± 144
32	83,485 ± 17,600	73,708 ± 1,606	16,470 ± 171
64	83,291 ± 15,568	72,358 ± 409	16,255 ± 208

uncons is slower than scala.List as expected — each call allocates
one Some and one Tuple2. As noted in the proposal, uncons is not
the intended traversal API. The foreach/foldLeft results above are
the relevant comparison.
This is the primary weakness of the design and does not improve with block size.

map (apply a function to every element)

blockSize	map (main)	map2 (experiment)	scala.List
4	44,270 ± 563	52,324 ± 320	44,246 ± 592
8	44,266 ± 421	40,149 ± 1073	43,917 ± 254
16	39,189 ± 534	41,345 ± 190	43,800 ± 105
32	35,201 ± 335	37,904 ± 852	43,889 ± 148
64	33,211 ± 549	38,099 ± 4615	44,589 ± 257

map beats scala.List by ~25% at blockSize=64. map2 (an alternative accumulation strategy)
is noisier , the error of ±4,615 ns/op at bs=64 indicates instability and makes it
unsuitable as a main implementation.

Key findings

foreach validates the proposal's cache locality claim — ~4x faster
than scala.List at blockSize=64 for both implementations
blockSize=32 or 64 appears optimal for bulk traversal operations
BlockedList is not a replacement for scala.List with this current design. It makes a deliberate trade off:
traversal heavy operations (forEach, foldLeft, map) benefit significantly from cachelocality,
operations (prepend, uncons) are more expensive due to Array management.

Also dropped FastBlockedList , it was just an experiment to see how much the write direct scheme moves the benchmark numbers, but as pointed out it's unsafe under shared references so it doesn't belong here. The file and numbers are kept for future reference if needed.

Benchmark methodology

Tool: JMH (Java Microbenchmark Harness) via sbt-jmh plugin
Mode: Average time (AverageTime)
Units: nanoseconds per operation (ns/op) — lower is better
Warmup: 5 iterations
Measurement: 10 iterations
Forks: 1
Threads: 1

Environment: JVM [openjdk 25.0.2 2026-01-20 LTS],
CPU [Intel Core 5 210H],
RAM [16GB RAM],
OS [Ubuntu 22.04]

Lists are pre-built in @Setup(Level.Trial) so construction cost is
excluded from traversal measurements. The benchmark suite is included
in bench/src/main/scala/cats/bench/BlockedListBenchmark.scala and
can be reproduced with:

sbt "bench/jmh:run -i 10 -wi 5 -f 1 -t 1 .*BlockedList.*"

Transparency note

English is not my first language. I used an LLM to help
with grammar and formatting in this PR description, and to generate the
initial benchmark boilerplate code. All implementation decisions, the
identification of bugs, the analysis of benchmark results, and the core
data structure logic were worked out by me. The AI was used as a writing
and tooling aid, not as a substitute for understanding.

Introduces BlockedList (copy-on-write) and BlockedLostCopy (write-direct) as proposed in typelevel#634. Includes JMH benchmarks comparing both implementations against scala.List across prepend, uncons, foldLeft, and foreach.

Zayd-R · 2026-03-23T15:26:29Z

I just noticed i named the implementaion that writes directly with Copy suffix, the name was just to differentiate it from my original copy on write implementation, srry for the confusion

gemelen · 2026-03-24T20:12:57Z

@Zayd-R thank you for working on this.

There are few things that I'd like you to fix in your changeset:

revisit your description about the PR, fix the typos and misnames (like BloackedLoistCopy, etc)
provide a description on the benchmarks - what tools did you use, what's the methodology, how to apply it to repeat the measuruments, what are the units in the results you provided (time, space, op/s, etc)
please, fix the issue raised by the CI on the missing headers

…eaders. Please let me know if anything else needs attention

johnynek · 2026-03-27T18:25:17Z

Question about:

FastBlockedList  — write-direct
Prepend writes directly into dead space (offset - 1) without copying,
since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail.

Suppose I have list1 and I append to it to create list2. It uses the fast scheme so it does so without copying. I also make a second append to list1 to create list3. Doesn't this try to write into the same space?

I think that optimization only works in a system like Rust where you can be sure no one still has a reference to the previous value. I don't think that works here without using some mutable ownership tracking (maybe an AtomicBoolean which tracks if you own the value or not, and doing an append gives ownership to the child if you still own it, or something, otherwise clone).

core/src/main/scala/cats/collections/BlockedList.scala

johnynek · 2026-03-27T18:32:08Z

core/src/main/scala/cats/collections/BlockedList.scala

+    }
+
+    override def prepend[A >: T](a: A): BlockedList[A] = {
+      val newArray = new Array[Any](BlockSize)


it's a hypothesis that always allocating the same size could be more efficient for the GC (since it will have many identical arrays to reuse potentially), but that is just a guess. An alternative would be that we allocate only as big of an Array as we would need, up to a maximum size. That may improve performance in practice.

core/src/main/scala/cats/collections/FastBlockedList.scala

Zayd-R · 2026-03-27T20:15:19Z

Good catch on the concurrency issue in FastBlockedList , I missed that, and you're right that it isn't safe as written. It was never intended to be the main implementation, it was purely an experiment on my side to measure how much the write direct method would move the benchmark numbers, so it knowingly breaks the immutability rule.

On BlockSize being a per node parameter , that's intentional for benchmarking convenience so I can vary it without recompiling, as I mentioned in the PR description. I know it should be a compile time constant in a real implementation and will fix that.

For forEach, foldLeft, and map , good catches on all three. I was actually wondering why my foldLeft wasn't beating scala.List; the stack recursion explains it. I'll refactor all of them to be @tailrec and stack safe, which then should beat scala.List in benchmarks.

On the array allocation strategy , the fixed-size approach was based on the same GC reuse hypothesis you mentioned. The variable size up to a maximum approach is interesting though, I'll try benchmarking both and see which holds up.

I should edit my pdf proposal then to avoid confusion and explicitly mention that the mutation implementation was not intended to be the main implementation or remove it all from the proposal.

…rk result

Zayd-R · 2026-03-28T17:14:18Z

Refactored forEach, foldLeft, and map to be @tailrec and stack-safe
map specifically is now a @tailrec accumulation + a reverse pass at the end, both built on foldLeft so both passes are safe.

Reran all benchmarks after the fixes , forEach is now ~4× faster than scala.List at blockSize=64, map beats it by ~25% at the same size, and foldLeft is actually competitive now. Numbers are in the description above.

The optimal block size across all benchmarks seems to be 32–64.

johnynek

you might also want to try using: https://github.com/async-profiler/async-profiler

to get an idea where we are spending time to see if we can improve those areas first.

johnynek · 2026-03-28T17:06:53Z

core/src/main/scala/cats/collections/BlockedList.scala

+    override def map[B](f: T => B): BlockedList[B] = {
+
+      @tailrec
+      def helper(curent: BlockedList[T], acc: BlockedList[B]): BlockedList[B] = {


I think a faster approach would be to build a List[Array[Any]] which is the reversed set of blocks. Then with that pre-reversed set, build back up.

So the recursion would work on:

def helper(blocks: List[Array[Any]], acc: BlockedList[B]): BlockedList[B]

this way we don't have to reverse the blocks themselves, just the order they came in.

So we would be doing two passes and one time building of Impl, instead of building Impl twice. I will implement that approach and compare the benchmarks of both implementaion

yes, but the current .reverse at the end is also a second pass that also has to reverse the Array.

core/src/main/scala/cats/collections/BlockedList.scala

johnynek · 2026-03-28T17:23:32Z

one way to compare the block sizes would be to look at the geometric mean of the speedup relative to List across block sizes. For instance, using a bigger block size really helps foreach, but it also really hurt prepend.

johnynek · 2026-03-28T17:25:59Z

bench/src/main/scala/cats/collections/bench/BlockedListBenchmark.scala

+  }
+
+  @Benchmark
+  def scalaListUncons(): Unit = {


this isn't uncons. This isn't a fair comparison. This is just .tail. We can implement .tail on BlockedList as well, and that won't have to allocate the tuple or the Some. We should do that benchmark, but also implement def uncons[A](lst: List[A]): Option[(A, List[A])] and compare that to our implementation.

Both benchmarks are useful.

I totally missed that tail skips the Optio allocation entirely so its not a fair comparison against uncons . Ill add a proper def tail to BlockedList and benchmark that separately, then also add a real uncons benchmark on scala.List using Option[(A, List[A])] so both sides are compared on equal terms.

Add block-based list implementation with benchmarks

261b997

Introduces BlockedList (copy-on-write) and BlockedLostCopy (write-direct) as proposed in typelevel#634. Includes JMH benchmarks comparing both implementations against scala.List across prepend, uncons, foldLeft, and foreach.

Zayd-R marked this pull request as ready for review March 23, 2026 15:20

Zayd-R added 4 commits March 25, 2026 13:41

Fixed the typos, added benchmark details, and added missing license h…

988d115

…eaders. Please let me know if anything else needs attention

Fixing headers

1113036

Fixing format

0d9ade1

Added map implementation and updated the benchmark results

440066d

johnynek reviewed Mar 27, 2026

View reviewed changes

Refactoring into tail-recursive implementation and adding new benchma…

c1fcf44

…rk result

johnynek reviewed Mar 28, 2026

View reviewed changes

Uh oh!

Conversation

Zayd-R commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two implementations explored

Results

prepend (build a list of 10k elements from empty)

foreach (visit every element)

foldLeft (sum all elements)

uncons (element-by-element traversal)

map (apply a function to every element)

Key findings

Benchmark methodology

Transparency note

Uh oh!

Zayd-R commented Mar 23, 2026

Uh oh!

gemelen commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnynek commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnynek Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Zayd-R commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zayd-R commented Mar 28, 2026

Uh oh!

johnynek left a comment

Choose a reason for hiding this comment

Uh oh!

johnynek Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Zayd-R Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

johnynek Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnynek commented Mar 28, 2026

Uh oh!

johnynek Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Zayd-R Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zayd-R commented Mar 23, 2026 •

edited

Loading

gemelen commented Mar 24, 2026 •

edited

Loading

Zayd-R commented Mar 27, 2026 •

edited

Loading