Skip to content

Core, Data, Lance: Add Lance file format support via File Format API#15751

Draft
tardunge wants to merge 1 commit intoapache:mainfrom
tardunge:feat/lance-format-poc
Draft

Core, Data, Lance: Add Lance file format support via File Format API#15751
tardunge wants to merge 1 commit intoapache:mainfrom
tardunge:feat/lance-format-poc

Conversation

@tardunge
Copy link

This is a draft/POC — not requesting merge. Opening for visibility and discussion.

Relates to #12225 (File Format API) and #13438 (Lance integration discussion).

Summary

Working proof-of-concept implementing Lance as an Iceberg file format using the
File Format API (#12774). Lance is a columnar format for AI/ML workloads with
100x faster random access than Parquet, native vector search, and an Arrow-native
data model.

This is the first new format added through the pluggable File Format API.

What works

  • Full read/write round-trip via FormatModel / ReadBuilder / WriteBuilder
  • Spark 4.1 SQL: CREATE TABLE, INSERT INTO, SELECT, column projection, WHERE filtering
  • Both vectorized (ColumnarBatch via ArrowColumnVector) and row-based (InternalRow) read paths
  • 19 unit tests (round-trip, schema preservation, nulls, projection, batch sizing, file length)
  • Iceberg schema stored in Lance file metadata via Arrow field ID preservation

Spark SQL demo

CREATE TABLE t (id INT, name STRING, score DOUBLE)
USING iceberg TBLPROPERTIES ('write.format.default' = 'lance');

INSERT INTO t VALUES (1, 'alice', 95.5), (2, 'bob', 87.3), (3, 'charlie', 92.1);
SELECT * FROM t WHERE score > 90;
-- 1  alice    95.5
-- 3  charlie  92.1

Changes

New module: lance/

File Purpose
LanceFormatModel FormatModel impl with WriteBuilderWrapper / ReadBuilderWrapper
LanceFileAppender Bridges record-at-a-time add(D) to batch write(VectorSchemaRoot)
LanceSchemaUtil Bidirectional Iceberg-Arrow schema conversion with field ID preservation
LanceArrowConverter Row-level Record-Arrow vector type conversion
GenericLanceReader/Writer ReaderFunction / WriterFunction for generic Record
LanceFormatModels Registration entry point (auto-discovered by FormatModelRegistry)
spark/SparkLanceFormatModels Registers Lance for ColumnarBatch + InternalRow
spark/SparkLanceColumnarReader Zero-copy Arrow to ColumnarBatch via ArrowColumnVector
spark/SparkLanceRowReader Arrow to GenericInternalRow row-by-row conversion
spark/SparkLanceWriter InternalRow to Arrow for Lance writes
LANCE_SDK_GAPS.md Documents 5 gaps in the Lance Java SDK

Modified files

File Change
FileFormat.java Add LANCE("lance", false)
FormatModelRegistry.java Add LanceFormatModels to CLASSES_TO_REGISTER
settings.gradle Register lance module
build.gradle Add project(':iceberg-lance') block
spark/v4.1/build.gradle Exclude Lance from shadow jar

Architecture decisions

Arrow relocation: The Spark runtime shadow jar relocates org.apache.arrow.
Lance uses Arrow via JNI (Rust native code with hardcoded Java class names), so
Lance code cannot be relocated. All Lance code (including Spark readers/writers)
lives in the lance/ module outside the shadow jar. The FormatModel interface
boundary has zero Arrow imports, so relocated and unrelocated code never exchange
Arrow objects.

Runtime classpath:

--jars iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar,jar-jni.jar,arrow-c-data.jar

Spark registration: LanceFormatModels.register() conditionally registers
Spark format models via Class.forName("...InternalRow") — skipped when Spark
is not on the classpath.

Known gaps

Gap Impact Owner
File length (getBytesWritten) Uses OutputFile.toInputFile().getLength() workaround Lance JNI (1-line fix)
Column statistics No file-level pruning Lance PR lance-format/lance#5639 + JNI
Split planning One task per file iceberg-lance + Lance JNI
Predicate pushdown No-op (residual filter preserves correctness) iceberg-lance
Name mapping Column rename not supported iceberg-lance

Details in lance/LANCE_SDK_GAPS.md.

References

cc @pvary @westonpace

Implements Lance as a new Iceberg file format using the FormatModel API
(introduced in PR apache#12774). This is the first new format added through
the pluggable File Format API, validating its design for external formats.

Changes:
- Add LANCE to FileFormat enum
- Add iceberg-lance module with LanceFormatModel, readers, writers,
  schema conversion, and Arrow type mapping
- Add Spark 4.1 integration (ColumnarBatch + InternalRow) in the lance
  module to avoid Arrow relocation conflicts with the runtime shadow jar
- Register Lance format models in FormatModelRegistry with auto-discovery
- Exclude Lance from the Spark runtime shadow jar (JNI native code
  requires unrelocated Arrow classes)
- 19 unit tests covering round-trip, schema preservation, projection,
  null handling, batch sizing, and file length

Runtime classpath for Spark:
  --jars iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar,
         jar-jni.jar,arrow-c-data.jar

Known gaps documented in lance/LANCE_SDK_GAPS.md:
  1. getBytesWritten — Lance JNI discards Rust finish() return value
  2. Column statistics — awaiting Lance PR apache#5639 Java SDK exposure
  3. Split planning — needs byte-offset-to-row mapping
  4. Predicate pushdown — no-op currently, correctness preserved
  5. Name mapping — schema evolution support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant