Core, Data, Lance: Add Lance file format support via File Format API by tardunge · Pull Request #15751 · apache/iceberg

tardunge · 2026-03-24T14:09:40Z

This is a draft/POC — not requesting merge. Opening for visibility and discussion.

Relates to #12225 (File Format API) and #13438 (Lance integration discussion).

Summary

Working proof-of-concept implementing Lance as an Iceberg file format using the
File Format API (#12774). Lance is a columnar format for AI/ML workloads with
100x faster random access than Parquet, native vector search, and an Arrow-native
data model.

This is the first new format added through the pluggable File Format API.

What works

Full read/write round-trip via FormatModel / ReadBuilder / WriteBuilder
Spark 4.1 SQL: CREATE TABLE, INSERT INTO, SELECT, column projection, WHERE filtering
Both vectorized (ColumnarBatch via ArrowColumnVector) and row-based (InternalRow) read paths
19 unit tests (round-trip, schema preservation, nulls, projection, batch sizing, file length)
Iceberg schema stored in Lance file metadata via Arrow field ID preservation

Spark SQL demo

CREATE TABLE t (id INT, name STRING, score DOUBLE)
USING iceberg TBLPROPERTIES ('write.format.default' = 'lance');

INSERT INTO t VALUES (1, 'alice', 95.5), (2, 'bob', 87.3), (3, 'charlie', 92.1);
SELECT * FROM t WHERE score > 90;
-- 1  alice    95.5
-- 3  charlie  92.1

Changes

New module: `lance/`

File	Purpose
`LanceFormatModel`	`FormatModel` impl with `WriteBuilderWrapper` / `ReadBuilderWrapper`
`LanceFileAppender`	Bridges record-at-a-time `add(D)` to batch `write(VectorSchemaRoot)`
`LanceSchemaUtil`	Bidirectional Iceberg-Arrow schema conversion with field ID preservation
`LanceArrowConverter`	Row-level Record-Arrow vector type conversion
`GenericLanceReader/Writer`	`ReaderFunction` / `WriterFunction` for generic `Record`
`LanceFormatModels`	Registration entry point (auto-discovered by `FormatModelRegistry`)
`spark/SparkLanceFormatModels`	Registers Lance for `ColumnarBatch` + `InternalRow`
`spark/SparkLanceColumnarReader`	Zero-copy Arrow to `ColumnarBatch` via `ArrowColumnVector`
`spark/SparkLanceRowReader`	Arrow to `GenericInternalRow` row-by-row conversion
`spark/SparkLanceWriter`	`InternalRow` to Arrow for Lance writes
`LANCE_SDK_GAPS.md`	Documents 5 gaps in the Lance Java SDK

Modified files

File	Change
`FileFormat.java`	Add `LANCE("lance", false)`
`FormatModelRegistry.java`	Add `LanceFormatModels` to `CLASSES_TO_REGISTER`
`settings.gradle`	Register `lance` module
`build.gradle`	Add `project(':iceberg-lance')` block
`spark/v4.1/build.gradle`	Exclude Lance from shadow jar

Architecture decisions

Arrow relocation: The Spark runtime shadow jar relocates org.apache.arrow.
Lance uses Arrow via JNI (Rust native code with hardcoded Java class names), so
Lance code cannot be relocated. All Lance code (including Spark readers/writers)
lives in the lance/ module outside the shadow jar. The FormatModel interface
boundary has zero Arrow imports, so relocated and unrelocated code never exchange
Arrow objects.

Runtime classpath:

--jars iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar,jar-jni.jar,arrow-c-data.jar

Spark registration: LanceFormatModels.register() conditionally registers
Spark format models via Class.forName("...InternalRow") — skipped when Spark
is not on the classpath.

Known gaps

Gap	Impact	Owner
File length (`getBytesWritten`)	Uses `OutputFile.toInputFile().getLength()` workaround	Lance JNI (1-line fix)
Column statistics	No file-level pruning	Lance PR lance-format/lance#5639 + JNI
Split planning	One task per file	iceberg-lance + Lance JNI
Predicate pushdown	No-op (residual filter preserves correctness)	iceberg-lance
Name mapping	Column rename not supported	iceberg-lance

Details in lance/LANCE_SDK_GAPS.md.

References

cc @pvary @westonpace

Implements Lance as a new Iceberg file format using the FormatModel API (introduced in PR apache#12774). This is the first new format added through the pluggable File Format API, validating its design for external formats. Changes: - Add LANCE to FileFormat enum - Add iceberg-lance module with LanceFormatModel, readers, writers, schema conversion, and Arrow type mapping - Add Spark 4.1 integration (ColumnarBatch + InternalRow) in the lance module to avoid Arrow relocation conflicts with the runtime shadow jar - Register Lance format models in FormatModelRegistry with auto-discovery - Exclude Lance from the Spark runtime shadow jar (JNI native code requires unrelocated Arrow classes) - 19 unit tests covering round-trip, schema preservation, projection, null handling, batch sizing, and file length Runtime classpath for Spark: --jars iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar, jar-jni.jar,arrow-c-data.jar Known gaps documented in lance/LANCE_SDK_GAPS.md: 1. getBytesWritten — Lance JNI discards Rust finish() return value 2. Column statistics — awaiting Lance PR apache#5639 Java SDK exposure 3. Split planning — needs byte-offset-to-row mapping 4. Predicate pushdown — no-op currently, correctness preserved 5. Name mapping — schema evolution support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added API spark core build labels Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core, Data, Lance: Add Lance file format support via File Format API#15751

Core, Data, Lance: Add Lance file format support via File Format API#15751
tardunge wants to merge 1 commit intoapache:mainfrom
tardunge:feat/lance-format-poc

tardunge commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tardunge commented Mar 24, 2026

Summary

What works

Spark SQL demo

Changes

New module: lance/

Modified files

Architecture decisions

Known gaps

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New module: `lance/`