Spark: fix NPE thrown for MAP/LIST columns on DELETE, UPDATE, and MERGE operations#15726
Merged
singhpk234 merged 3 commits intoapache:mainfrom Mar 25, 2026
Merged
Conversation
09f6640 to
831b8d5
Compare
831b8d5 to
c605e64
Compare
…UsedFieldIds BaseSparkScanBuilder.allUsedFieldIds() used TypeUtil.getProjectedIds() which omits MAP and LIST field IDs (it is designed for column projection, not collision avoidance). This caused _partition struct child IDs to be reassigned to the same IDs as MAP/LIST columns, triggering a NPE in PruneColumns.isStruct() during merge-on-read scans when the _partition metadata column is included in the projection. Fix: use TypeUtil.indexById() which indexes ALL field IDs recursively, matching the behavior of the pre-1.11 Spark 3.5 code that this replaced. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
c605e64 to
d7e8246
Compare
szehon-ho
reviewed
Mar 23, 2026
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java
Outdated
Show resolved
Hide resolved
Member
|
Oh, also can you add a test for List as well? |
singhpk234
approved these changes
Mar 24, 2026
Contributor
singhpk234
left a comment
There was a problem hiding this comment.
LGTM too
+1 to @szehon-ho's suggestion
Contributor
Author
|
@szehon-ho added list column unit test here: afb97ac |
szehon-ho
reviewed
Mar 24, 2026
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java
Outdated
Show resolved
Hide resolved
szehon-ho
approved these changes
Mar 24, 2026
Contributor
|
Thanks for the change @antonlin1 ! Thank you for review @huaxingao @szehon-ho ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Partitioned tables with MAP or LIST columns throws NPE on DELETE, UPDATE, and MERGE operations when running Spark 4.1 + Iceberg 1.11 (reproduced with
iceberg-spark-runtime-4.1_2.13:1.11.0-20260306.003105-70):allUsedFieldIds()(introduced in #15297) usesTypeUtil.getProjectedIds()to build the set of field IDs already in use. This omits MAP and LIST container IDs, creating gaps in the ID space. A_partitionchild field gets reassigned into one of these gaps, landing on the same ID as a MAP or LIST column.During the Parquet scan,
PruneColumnsfinds this ID inselectedIds(via_partition) and looks it up in the expected Iceberg schema — but the ID belongs to a nested_partitionchild, not a top-level data column, soexpected.field(id)returnsnull.isStruct()then NPEs on that null.This regression did not exist in 1.10.x where the equivalent code used
TypeUtil.indexById()which correctly indexes all field IDs including MAP/LIST containers.Fix
Replace
TypeUtil.getProjectedIds()withTypeUtil.indexById().keySet()inallUsedFieldIds()in order to take MAP + LIST columns into account