Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements streaming part of transferia/transferia#199
Utilize apache/iceberg-go#339
Iceberg Streaming Sink
Key Components
Data Flow
The workflow follows these steps:
Streaming Process Diagram
Description of the Streaming Upload Process:
The streaming data upload process to an Iceberg table works as follows:
sequenceDiagram participant Coordinator participant Worker1 participant Worker2 participant S3 participant IcebergCatalog note over Worker1,Worker2: Streaming data processing par Worker 1 processing Worker1->>Worker1: Receive streaming data Worker1->>Worker1: Convert to Arrow format Worker1->>S3: Write Parquet file 1-1 Worker1->>Coordinator: Register file (key=streaming_files_tableA_1) and Worker 2 processing Worker2->>Worker2: Receive streaming data Worker2->>Worker2: Convert to Arrow format Worker2->>S3: Write Parquet file 2-1 Worker2->>Coordinator: Register file (key=streaming_files_tableA_2) end note over Worker1: On schedule (commit interval) Worker1->>Coordinator: Fetch all files Worker1->>Worker1: Group files by table Worker1->>IcebergCatalog: Create transaction Worker1->>IcebergCatalog: Add tableA files to transaction Worker1->>IcebergCatalog: Commit transaction Worker1->>Coordinator: Clear committed files note over Worker1,Worker2: Continue processing par Worker 1 continuation Worker1->>Worker1: Receive new streaming data Worker1->>Worker1: Convert to Arrow format Worker1->>S3: Write Parquet file 1-2 Worker1->>Coordinator: Register file (key=streaming_files_tableA_1) and Worker 2 continuation Worker2->>Worker2: Receive new streaming data Worker2->>Worker2: Convert to Arrow format Worker2->>S3: Write Parquet file 2-2 Worker2->>Coordinator: Register file (key=streaming_files_tableA_2) endImplementation Details
Worker Initialization
Each worker is initialized with:
When a worker starts, it creates a connection to the Iceberg catalog system (either REST-based or Glue-based) and prepares to handle incoming data.
Parquet File Creation
For each batch of data:
The file naming system ensures uniqueness by incorporating:
This prevents filename collisions even when multiple workers process data simultaneously.
File Tracking
Each worker maintains an in-memory map where the key is the table identifier and the value is a list of files created for that table. A mutex is used to ensure thread safety when adding to this list. Information about files is also passed to the coordinator with a key format of "streaming_files_{tableID}_{workerNum}".
Periodic Commits
Unlike snapshot mode, where commits happen only after the entire load is complete, in streaming mode:
Table Management