Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
affb425
add V2DeleteHandling enum for iceberg v2 delete file support
Shekharrajak Apr 6, 2026
fee0e4d
add iceberg-data dependency and promote iceberg-parquet to compile scope
Shekharrajak Apr 6, 2026
8cff6ae
implement IcebergRecordConverter for iceberg Record to Druid map conv…
Shekharrajak Apr 6, 2026
e2ff4ca
implement streaming IcebergNativeRecordReader with v2 delete support
Shekharrajak Apr 6, 2026
40955c2
add FileScanResult and extractFileScanTasks to IcebergCatalog for v2 …
Shekharrajak Apr 6, 2026
49d0d1b
wire v2DeleteHandling into IcebergInputSource with native reader routing
Shekharrajak Apr 6, 2026
7a303bb
add unit tests for IcebergRecordConverter and V2DeleteHandling
Shekharrajak Apr 6, 2026
0db7e89
fix compilation: use public Iceberg APIs and correct Druid method sig…
Shekharrajak Apr 6, 2026
25af2a1
fix test compilation and all test failures for v2 delete handling
Shekharrajak Apr 6, 2026
09d1c34
cleanup temporary settings file
Shekharrajak Apr 6, 2026
716014a
remove V2DeleteHandling config and add auto-detect routing in Iceberg…
Shekharrajak Apr 6, 2026
faaed10
add DeleteFileInfo POJO for serializable delete file metadata
Shekharrajak Apr 6, 2026
9346f1b
add IcebergFileTaskInputSource for per-task serializable v2 input source
Shekharrajak Apr 6, 2026
d1537b2
rewrite IcebergNativeRecordReader with manual positional and equality…
Shekharrajak Apr 6, 2026
5c1364b
add CompositeInputSourceReader for multi-task v2 reading in IcebergIn…
Shekharrajak Apr 6, 2026
e21ce61
register IcebergFileTaskInputSource in IcebergDruidModule
Shekharrajak Apr 6, 2026
6464d9f
update all tests for auto-detect v2 architecture without V2DeleteHand…
Shekharrajak Apr 6, 2026
d44dff6
fix missing Iterator import in IcebergInputSource
Shekharrajak Apr 6, 2026
571ff7a
docs: add iceberg v2 delete file support documentation
Shekharrajak Apr 6, 2026
1281e54
fix checkstyle, strict compilation, and double-brace initialization e…
Shekharrajak Apr 6, 2026
38229a5
add v2 table creation overload to IcebergRestCatalogResource
Shekharrajak Apr 6, 2026
cb6af40
add IcebergV2DeleteIngestionTest with equality, positional, mixed, an…
Shekharrajak Apr 6, 2026
a6dd173
fix IT compilation: correct GenericParquetWriter method reference and…
Shekharrajak Apr 6, 2026
15cb70d
fix dependency:analyze by scoping iceberg-data as runtime and adding …
Shekharrajak Apr 6, 2026
6d13e3a
remove stray processing/.factorypath IDE artifact
Shekharrajak Apr 6, 2026
fa6f4b6
remove all stray .factorypath IDE artifacts from tracking
Shekharrajak Apr 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions docs/development/extensions-contrib/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,62 @@ Example:

When `residualFilterMode` is set to `fail` and a residual filter is detected, the job will fail with an error message indicating which filter expression produced the residual. This helps ensure data quality by preventing unintended rows from being ingested.

## Iceberg v2 delete file support

Iceberg v2 tables support row-level deletes through two types of delete files:

| File type | Content | Purpose |
|-----------|---------|---------|
| Positional delete file | `(file_path, row_position)` pairs | Deletes the row at a specific position in a data file |
| Equality delete file | Column value sets | Deletes any row where the specified column values match |

The Iceberg extension automatically detects v2 delete files during table scan. No configuration changes are required to existing ingestion specs.

### How it works

When `IcebergInputSource` scans the Iceberg table, it inspects each `FileScanTask` for associated delete files:

- **No delete files (v1 path)**: Data file paths are extracted and delegated to `warehouseSource` for reading. This is the existing behavior and remains unchanged.
- **Delete files detected (v2 path)**: Each task is wrapped in an `IcebergFileTaskInputSource` that carries the data file path, delete file metadata (paths, types, equality field IDs), and the serialized table schema. The `IcebergNativeRecordReader` then applies deletes at read time:
1. Reads positional delete files and builds a set of deleted row positions for the current data file.
2. Reads equality delete files and builds sets of deleted key tuples.
3. Streams the data file and skips any row that is position-deleted or equality-deleted.
4. Converts surviving Iceberg records to Druid `InputRow` objects.

### Example

Given an Iceberg v2 table with the following snapshots:

```
Snapshot 1 (append): data-001.parquet -> rows: order_id=1, order_id=2, order_id=3
Snapshot 2 (delete): eq-delete-001.parquet -> "delete where order_id = 2"
```

Druid ingests only `order_id=1` and `order_id=3`. The deleted row (`order_id=2`) is excluded automatically.

The ingestion spec is identical to a v1 table -- no additional fields are needed:

```json
{
"type": "iceberg",
"tableName": "orders",
"namespace": "analytics",
"icebergCatalog": {
"type": "rest",
"catalogUri": "http://localhost:8181"
},
"warehouseSource": {
"type": "s3"
}
}
```

### Performance considerations

- Positional delete files are read into an in-memory `Set<Long>` per data file. Memory usage is proportional to the number of deleted positions, not the data file size.
- Equality delete files are read into an in-memory `Set` of key tuples. For tables with very large equality delete files, this may increase memory usage on the ingestion worker.
- A v2-format table that has never had any rows deleted (no delete files) automatically goes through the v1 path with no overhead.

## Known limitations

This section lists the known limitations that apply to the Iceberg extension.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,21 @@ public Table createTable(String namespace, String tableName, Schema schema)
return getClientCatalog().createTable(tableId, schema);
}

/**
* Creates a table with custom properties (e.g., format-version=2 for v2 tables).
*/
public Table createTable(
String namespace,
String tableName,
Schema schema,
org.apache.iceberg.PartitionSpec partitionSpec,
Map<String, String> properties
)
{
final TableIdentifier tableId = TableIdentifier.of(namespace, tableName);
return getClientCatalog().createTable(tableId, schema, partitionSpec, properties);
}

/**
* Drops a table from the REST catalog. Best-effort; ignores errors.
*/
Expand Down
Loading
Loading