Skip to content

Block-based immutable list implementation (GSoC proposal)#809

Open
Zayd-R wants to merge 6 commits intotypelevel:masterfrom
Zayd-R:immutable-blocked-list-proposal
Open

Block-based immutable list implementation (GSoC proposal)#809
Zayd-R wants to merge 6 commits intotypelevel:masterfrom
Zayd-R:immutable-blocked-list-proposal

Conversation

@Zayd-R
Copy link
Copy Markdown

@Zayd-R Zayd-R commented Mar 23, 2026

Summary

This is an early-stage implementation of the block-based immutable list
proposed in #634, submitted as part of GSoC work. The goal of this PR is to share the implementation and benchmark data to explore the design space before committing to a final approach.

Two implementations explored

BlockedList — copy-on-write
Every prepend into dead space copies the valid portion of the block before
writing. Fully persistent and safe for branching use cases.

FastBlockedList — write-direct
Prepend writes directly into dead space (offset - 1) without copying,
since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail. **(Deprecated) **

Both implementations store BlockSize per node to test it with many sizes in the benchmark file without
recompilation.

Results

All times in ns/op. Lower is better.

prepend (build a list of 10k elements from empty)

blockSize BlockedList FastBlockedList scala.List
4 138,526 ± 4,810 52,328 ± 218 40,419 ± 114
8 167,084 ± 1,601 54,333 ± 228 40,460 ± 296
16 193,820 ± 6,885 56,385 ± 341 39,556 ± 100
32 246,475 ± 2,368 54,803 ± 337 39,600 ± 311
64 378,280 ± 74,722 53,403 ± 226 39,592 ± 164

Copy-on-write prepend scales linearly with blockSize due to the arraycopy cost.
Write direct prepend is flat across block sizes and only ~30% slower than scala.List.

foreach (visit every element)

blockSize copyOnWrite noCopyWrite scala.List
4 12,198 ± 55 14,418 ± 154 21,800 ± 287
8 7,674 ± 34 9,884 ± 130 21,658 ± 102
16 7,327 ± 142 8,259 ± 160 21,735 ± 162
32 6,360 ± 14 6,533 ± 73 21,989 ± 197
64 5,402 ± 11 5,640 ± 31 21,517 ± 194

copyOnWrite at blockSize=64 is ~4× faster than scala.List.


foldLeft (sum all elements)

blockSize copyOnWrite noCopyWrite scala.List
4 23,763 ± 71 27,614 ± 159 28,416 ± 268
8 31,633 ± 81 26,432 ± 83 26,904 ± 365
16 32,133 ± 507 25,902 ± 74 27,862 ± 404
32 24,442 ± 100 25,630 ± 77 27,272 ± 325
64 23,888 ± 65 24,672 ± 80 27,434 ± 392

After the @tailrec refactor , the numbers are close with a win for the CopyOnWrite at BlockSize=32 and 64
the previous stack recursive implementation was offsetting the cache advantage.


uncons (element-by-element traversal)

blockSize BlockedList (cow) FastBlockedList scala.List
4 82,950 ± 6,042 70,404 ± 488 16,328 ± 202
8 84,576 ± 8,152 76,493 ± 545 16,232 ± 97
16 82,985 ± 14,101 74,295 ± 313 16,284 ± 144
32 83,485 ± 17,600 73,708 ± 1,606 16,470 ± 171
64 83,291 ± 15,568 72,358 ± 409 16,255 ± 208

uncons is slower than scala.List as expected — each call allocates
one Some and one Tuple2. As noted in the proposal, uncons is not
the intended traversal API. The foreach/foldLeft results above are
the relevant comparison.
This is the primary weakness of the design and does not improve with block size.


map (apply a function to every element)

blockSize map (main) map2 (experiment) scala.List
4 44,270 ± 563 52,324 ± 320 44,246 ± 592
8 44,266 ± 421 40,149 ± 1073 43,917 ± 254
16 39,189 ± 534 41,345 ± 190 43,800 ± 105
32 35,201 ± 335 37,904 ± 852 43,889 ± 148
64 33,211 ± 549 38,099 ± 4615 44,589 ± 257

map beats scala.List by ~25% at blockSize=64. map2 (an alternative accumulation strategy)
is noisier , the error of ±4,615 ns/op at bs=64 indicates instability and makes it
unsuitable as a main implementation.


Key findings

  • foreach validates the proposal's cache locality claim — ~4x faster
    than scala.List at blockSize=64 for both implementations

  • blockSize=32 or 64 appears optimal for bulk traversal operations

  • BlockedList is not a replacement for scala.List with this current design. It makes a deliberate trade off:
    traversal heavy operations (forEach, foldLeft, map) benefit significantly from cachelocality,
    operations (prepend, uncons) are more expensive due to Array management.

Also dropped FastBlockedList , it was just an experiment to see how much the write direct scheme moves the benchmark numbers, but as pointed out it's unsafe under shared references so it doesn't belong here. The file and numbers are kept for future reference if needed.

Benchmark methodology

Tool: JMH (Java Microbenchmark Harness) via sbt-jmh plugin
Mode: Average time (AverageTime)
Units: nanoseconds per operation (ns/op) — lower is better
Warmup: 5 iterations
Measurement: 10 iterations
Forks: 1
Threads: 1

Environment: JVM [openjdk 25.0.2 2026-01-20 LTS],
CPU [Intel Core 5 210H],
RAM [16GB RAM],
OS [Ubuntu 22.04]

Lists are pre-built in @Setup(Level.Trial) so construction cost is
excluded from traversal measurements. The benchmark suite is included
in bench/src/main/scala/cats/bench/BlockedListBenchmark.scala and
can be reproduced with:

sbt "bench/jmh:run -i 10 -wi 5 -f 1 -t 1 .*BlockedList.*"

Transparency note

English is not my first language. I used an LLM to help
with grammar and formatting in this PR description, and to generate the
initial benchmark boilerplate code. All implementation decisions, the
identification of bugs, the analysis of benchmark results, and the core
data structure logic were worked out by me. The AI was used as a writing
and tooling aid, not as a substitute for understanding.

Introduces BlockedList (copy-on-write) and BlockedLostCopy (write-direct)
as proposed in typelevel#634. Includes JMH benchmarks comparing both
implementations against scala.List across prepend, uncons, foldLeft,
and foreach.
@Zayd-R Zayd-R marked this pull request as ready for review March 23, 2026 15:20
@Zayd-R
Copy link
Copy Markdown
Author

Zayd-R commented Mar 23, 2026

I just noticed i named the implementaion that writes directly with Copy suffix, the name was just to differentiate it from my original copy on write implementation, srry for the confusion

@gemelen
Copy link
Copy Markdown
Collaborator

gemelen commented Mar 24, 2026

@Zayd-R thank you for working on this.

There are few things that I'd like you to fix in your changeset:

  • revisit your description about the PR, fix the typos and misnames (like BloackedLoistCopy, etc)
  • provide a description on the benchmarks - what tools did you use, what's the methodology, how to apply it to repeat the measuruments, what are the units in the results you provided (time, space, op/s, etc)
  • please, fix the issue raised by the CI on the missing headers

@johnynek
Copy link
Copy Markdown
Contributor

Question about:

FastBlockedList  — write-direct
Prepend writes directly into dead space (offset - 1) without copying,
since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail.

Suppose I have list1 and I append to it to create list2. It uses the fast scheme so it does so without copying. I also make a second append to list1 to create list3. Doesn't this try to write into the same space?

I think that optimization only works in a system like Rust where you can be sure no one still has a reference to the previous value. I don't think that works here without using some mutable ownership tracking (maybe an AtomicBoolean which tracks if you own the value or not, and doing an append gives ownership to the child if you still own it, or something, otherwise clone).

}

override def prepend[A >: T](a: A): BlockedList[A] = {
val newArray = new Array[Any](BlockSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a hypothesis that always allocating the same size could be more efficient for the GC (since it will have many identical arrays to reuse potentially), but that is just a guess. An alternative would be that we allocate only as big of an Array as we would need, up to a maximum size. That may improve performance in practice.

@Zayd-R
Copy link
Copy Markdown
Author

Zayd-R commented Mar 27, 2026

Good catch on the concurrency issue in FastBlockedList , I missed that, and you're right that it isn't safe as written. It was never intended to be the main implementation, it was purely an experiment on my side to measure how much the write direct method would move the benchmark numbers, so it knowingly breaks the immutability rule.

On BlockSize being a per node parameter , that's intentional for benchmarking convenience so I can vary it without recompiling, as I mentioned in the PR description. I know it should be a compile time constant in a real implementation and will fix that.

For forEach, foldLeft, and map , good catches on all three. I was actually wondering why my foldLeft wasn't beating scala.List; the stack recursion explains it. I'll refactor all of them to be @tailrec and stack safe, which then should beat scala.List in benchmarks.

On the array allocation strategy , the fixed-size approach was based on the same GC reuse hypothesis you mentioned. The variable size up to a maximum approach is interesting though, I'll try benchmarking both and see which holds up.

I should edit my pdf proposal then to avoid confusion and explicitly mention that the mutation implementation was not intended to be the main implementation or remove it all from the proposal.

@Zayd-R
Copy link
Copy Markdown
Author

Zayd-R commented Mar 28, 2026

Refactored forEach, foldLeft, and map to be @tailrec and stack-safe
map specifically is now a @tailrec accumulation + a reverse pass at the end, both built on foldLeft so both passes are safe.

Reran all benchmarks after the fixes , forEach is now ~4× faster than scala.List at blockSize=64, map beats it by ~25% at the same size, and foldLeft is actually competitive now. Numbers are in the description above.

The optimal block size across all benchmarks seems to be 32–64.

Copy link
Copy Markdown
Contributor

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might also want to try using: https://github.com/async-profiler/async-profiler

to get an idea where we are spending time to see if we can improve those areas first.

override def map[B](f: T => B): BlockedList[B] = {

@tailrec
def helper(curent: BlockedList[T], acc: BlockedList[B]): BlockedList[B] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a faster approach would be to build a List[Array[Any]] which is the reversed set of blocks. Then with that pre-reversed set, build back up.

So the recursion would work on:

def helper(blocks: List[Array[Any]], acc:  BlockedList[B]): BlockedList[B]

this way we don't have to reverse the blocks themselves, just the order they came in.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would be doing two passes and one time building of Impl, instead of building Impl twice. I will implement that approach and compare the benchmarks of both implementaion

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the current .reverse at the end is also a second pass that also has to reverse the Array.

@johnynek
Copy link
Copy Markdown
Contributor

one way to compare the block sizes would be to look at the geometric mean of the speedup relative to List across block sizes. For instance, using a bigger block size really helps foreach, but it also really hurt prepend.

}

@Benchmark
def scalaListUncons(): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't uncons. This isn't a fair comparison. This is just .tail. We can implement .tail on BlockedList as well, and that won't have to allocate the tuple or the Some. We should do that benchmark, but also implement def uncons[A](lst: List[A]): Option[(A, List[A])] and compare that to our implementation.

Both benchmarks are useful.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that tail skips the Optio allocation entirely so its not a fair comparison against uncons . Ill add a proper def tail to BlockedList and benchmark that separately, then also add a real uncons benchmark on scala.List using Option[(A, List[A])] so both sides are compared on equal terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants