Core, Data, Lance: Add Lance file format support via File Format API#15751
Draft
tardunge wants to merge 1 commit intoapache:mainfrom
Draft
Core, Data, Lance: Add Lance file format support via File Format API#15751tardunge wants to merge 1 commit intoapache:mainfrom
tardunge wants to merge 1 commit intoapache:mainfrom
Conversation
Implements Lance as a new Iceberg file format using the FormatModel API (introduced in PR apache#12774). This is the first new format added through the pluggable File Format API, validating its design for external formats. Changes: - Add LANCE to FileFormat enum - Add iceberg-lance module with LanceFormatModel, readers, writers, schema conversion, and Arrow type mapping - Add Spark 4.1 integration (ColumnarBatch + InternalRow) in the lance module to avoid Arrow relocation conflicts with the runtime shadow jar - Register Lance format models in FormatModelRegistry with auto-discovery - Exclude Lance from the Spark runtime shadow jar (JNI native code requires unrelocated Arrow classes) - 19 unit tests covering round-trip, schema preservation, projection, null handling, batch sizing, and file length Runtime classpath for Spark: --jars iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar, jar-jni.jar,arrow-c-data.jar Known gaps documented in lance/LANCE_SDK_GAPS.md: 1. getBytesWritten — Lance JNI discards Rust finish() return value 2. Column statistics — awaiting Lance PR apache#5639 Java SDK exposure 3. Split planning — needs byte-offset-to-row mapping 4. Predicate pushdown — no-op currently, correctness preserved 5. Name mapping — schema evolution support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a draft/POC — not requesting merge. Opening for visibility and discussion.
Relates to #12225 (File Format API) and #13438 (Lance integration discussion).
Summary
Working proof-of-concept implementing Lance as an Iceberg file format using the
File Format API (#12774). Lance is a columnar format for AI/ML workloads with
100x faster random access than Parquet, native vector search, and an Arrow-native
data model.
This is the first new format added through the pluggable File Format API.
What works
FormatModel/ReadBuilder/WriteBuilderCREATE TABLE,INSERT INTO,SELECT, column projection,WHEREfilteringColumnarBatchviaArrowColumnVector) and row-based (InternalRow) read pathsSpark SQL demo
Changes
New module:
lance/LanceFormatModelFormatModelimpl withWriteBuilderWrapper/ReadBuilderWrapperLanceFileAppenderadd(D)to batchwrite(VectorSchemaRoot)LanceSchemaUtilLanceArrowConverterGenericLanceReader/WriterReaderFunction/WriterFunctionfor genericRecordLanceFormatModelsFormatModelRegistry)spark/SparkLanceFormatModelsColumnarBatch+InternalRowspark/SparkLanceColumnarReaderColumnarBatchviaArrowColumnVectorspark/SparkLanceRowReaderGenericInternalRowrow-by-row conversionspark/SparkLanceWriterInternalRowto Arrow for Lance writesLANCE_SDK_GAPS.mdModified files
FileFormat.javaLANCE("lance", false)FormatModelRegistry.javaLanceFormatModelstoCLASSES_TO_REGISTERsettings.gradlelancemodulebuild.gradleproject(':iceberg-lance')blockspark/v4.1/build.gradleArchitecture decisions
Arrow relocation: The Spark runtime shadow jar relocates
org.apache.arrow.Lance uses Arrow via JNI (Rust native code with hardcoded Java class names), so
Lance code cannot be relocated. All Lance code (including Spark readers/writers)
lives in the
lance/module outside the shadow jar. TheFormatModelinterfaceboundary has zero Arrow imports, so relocated and unrelocated code never exchange
Arrow objects.
Runtime classpath:
Spark registration:
LanceFormatModels.register()conditionally registersSpark format models via
Class.forName("...InternalRow")— skipped when Sparkis not on the classpath.
Known gaps
getBytesWritten)OutputFile.toInputFile().getLength()workaroundDetails in
lance/LANCE_SDK_GAPS.md.References
cc @pvary @westonpace