Block-based immutable list implementation (GSoC proposal)#809
Block-based immutable list implementation (GSoC proposal)#809Zayd-R wants to merge 6 commits intotypelevel:masterfrom
Conversation
Introduces BlockedList (copy-on-write) and BlockedLostCopy (write-direct) as proposed in typelevel#634. Includes JMH benchmarks comparing both implementations against scala.List across prepend, uncons, foldLeft, and foreach.
|
I just noticed i named the implementaion that writes directly with |
|
@Zayd-R thank you for working on this. There are few things that I'd like you to fix in your changeset:
|
…eaders. Please let me know if anything else needs attention
|
Question about: Suppose I have list1 and I append to it to create list2. It uses the fast scheme so it does so without copying. I also make a second append to list1 to create list3. Doesn't this try to write into the same space? I think that optimization only works in a system like Rust where you can be sure no one still has a reference to the previous value. I don't think that works here without using some mutable ownership tracking (maybe an AtomicBoolean which tracks if you own the value or not, and doing an append gives ownership to the child if you still own it, or something, otherwise clone). |
| } | ||
|
|
||
| override def prepend[A >: T](a: A): BlockedList[A] = { | ||
| val newArray = new Array[Any](BlockSize) |
There was a problem hiding this comment.
it's a hypothesis that always allocating the same size could be more efficient for the GC (since it will have many identical arrays to reuse potentially), but that is just a guess. An alternative would be that we allocate only as big of an Array as we would need, up to a maximum size. That may improve performance in practice.
|
Good catch on the concurrency issue in On For On the array allocation strategy , the fixed-size approach was based on the same GC reuse hypothesis you mentioned. The variable size up to a maximum approach is interesting though, I'll try benchmarking both and see which holds up. I should edit my pdf proposal then to avoid confusion and explicitly mention that the mutation implementation was not intended to be the main implementation or remove it all from the proposal. |
|
Refactored Reran all benchmarks after the fixes , The optimal block size across all benchmarks seems to be 32–64. |
johnynek
left a comment
There was a problem hiding this comment.
you might also want to try using: https://github.com/async-profiler/async-profiler
to get an idea where we are spending time to see if we can improve those areas first.
| override def map[B](f: T => B): BlockedList[B] = { | ||
|
|
||
| @tailrec | ||
| def helper(curent: BlockedList[T], acc: BlockedList[B]): BlockedList[B] = { |
There was a problem hiding this comment.
I think a faster approach would be to build a List[Array[Any]] which is the reversed set of blocks. Then with that pre-reversed set, build back up.
So the recursion would work on:
def helper(blocks: List[Array[Any]], acc: BlockedList[B]): BlockedList[B]
this way we don't have to reverse the blocks themselves, just the order they came in.
There was a problem hiding this comment.
So we would be doing two passes and one time building of Impl, instead of building Impl twice. I will implement that approach and compare the benchmarks of both implementaion
There was a problem hiding this comment.
yes, but the current .reverse at the end is also a second pass that also has to reverse the Array.
|
one way to compare the block sizes would be to look at the geometric mean of the speedup relative to List across block sizes. For instance, using a bigger block size really helps foreach, but it also really hurt prepend. |
| } | ||
|
|
||
| @Benchmark | ||
| def scalaListUncons(): Unit = { |
There was a problem hiding this comment.
this isn't uncons. This isn't a fair comparison. This is just .tail. We can implement .tail on BlockedList as well, and that won't have to allocate the tuple or the Some. We should do that benchmark, but also implement def uncons[A](lst: List[A]): Option[(A, List[A])] and compare that to our implementation.
Both benchmarks are useful.
There was a problem hiding this comment.
I totally missed that tail skips the Optio allocation entirely so its not a fair comparison against uncons . Ill add a proper def tail to BlockedList and benchmark that separately, then also add a real uncons benchmark on scala.List using Option[(A, List[A])] so both sides are compared on equal terms.
Summary
This is an early-stage implementation of the block-based immutable list
proposed in #634, submitted as part of GSoC work. The goal of this PR is to share the implementation and benchmark data to explore the design space before committing to a final approach.
Two implementations explored
BlockedList— copy-on-writeEvery prepend into dead space copies the valid portion of the block before
writing. Fully persistent and safe for branching use cases.
FastBlockedList— write-directPrepend writes directly into dead space (
offset - 1) without copying,since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail. **(Deprecated) **
Both implementations store
BlockSizeper node to test it with many sizes in the benchmark file withoutrecompilation.
Results
All times in ns/op. Lower is better.
prepend (build a list of 10k elements from empty)
Copy-on-write prepend scales linearly with blockSize due to the arraycopy cost.
Write direct prepend is flat across block sizes and only ~30% slower than
scala.List.foreach (visit every element)
copyOnWriteatblockSize=64is ~4× faster thanscala.List.foldLeft (sum all elements)
After the
@tailrecrefactor , the numbers are close with a win for the CopyOnWrite at BlockSize=32 and 64the previous stack recursive implementation was offsetting the cache advantage.
uncons (element-by-element traversal)
unconsis slower thanscala.Listas expected — each call allocatesone
Someand oneTuple2. As noted in the proposal,unconsis notthe intended traversal API. The
foreach/foldLeftresults above arethe relevant comparison.
This is the primary weakness of the design and does not improve with block size.
map (apply a function to every element)
mapbeatsscala.Listby ~25% atblockSize=64.map2(an alternative accumulation strategy)is noisier , the error of ±4,615 ns/op at
bs=64indicates instability and makes itunsuitable as a main implementation.
Key findings
foreachvalidates the proposal's cache locality claim — ~4x fasterthan
scala.Listat blockSize=64 for both implementationsblockSize=32 or 64 appears optimal for bulk traversal operations
BlockedListis not a replacement forscala.Listwith this current design. It makes a deliberate trade off:traversal heavy operations (
forEach,foldLeft,map) benefit significantly from cachelocality,operations (
prepend,uncons) are more expensive due to Array management.Also dropped FastBlockedList , it was just an experiment to see how much the write direct scheme moves the benchmark numbers, but as pointed out it's unsafe under shared references so it doesn't belong here. The file and numbers are kept for future reference if needed.
Benchmark methodology
Tool: JMH (Java Microbenchmark Harness) via sbt-jmh plugin
Mode: Average time (
AverageTime)Units: nanoseconds per operation (ns/op) — lower is better
Warmup: 5 iterations
Measurement: 10 iterations
Forks: 1
Threads: 1
Environment: JVM [openjdk 25.0.2 2026-01-20 LTS],
CPU [Intel Core 5 210H],
RAM [16GB RAM],
OS [Ubuntu 22.04]
Lists are pre-built in
@Setup(Level.Trial)so construction cost isexcluded from traversal measurements. The benchmark suite is included
in
bench/src/main/scala/cats/bench/BlockedListBenchmark.scalaandcan be reproduced with:
Transparency note
English is not my first language. I used an LLM to help
with grammar and formatting in this PR description, and to generate the
initial benchmark boilerplate code. All implementation decisions, the
identification of bugs, the analysis of benchmark results, and the core
data structure logic were worked out by me. The AI was used as a writing
and tooling aid, not as a substitute for understanding.