Skip to content

Add Enhanced Ingestion Mode to genddl Tool#85

Open
ron-daniel1 wants to merge 4 commits intoprestodb:mainfrom
ron-daniel1:rework-perf-wxd
Open

Add Enhanced Ingestion Mode to genddl Tool#85
ron-daniel1 wants to merge 4 commits intoprestodb:mainfrom
ron-daniel1:rework-perf-wxd

Conversation

@ron-daniel1
Copy link
Copy Markdown

Adds enhanced ingestion mode to cmd/genddl for generating TPC-DS data
ingestion SQL files.

Key Features:

  • Two-stage ingestion: separate source (CSV/TEXTFILE) and target (Parquet) tables
  • Catalog support: multi-catalog environments with source/target catalogs
  • Format handling: CSV with CAST/NULLIF, TEXTFILE with direct SELECT
  • Engine support: Presto (WITH clause) and Spark (USING iceberg + TBLPROPERTIES)
  • Correct schema syntax: CREATE SCHEMA catalog.schema WITH (location = 's3a://...')

Files Changed:

New Templates:

  • create_source_table.sql.tmpl - Source table DDL (CSV/TEXTFILE)
  • create_target_table.sql.tmpl - Target table DDL (Parquet)

Modified: main.go

Schema struct additions:

  • Mode string - Detects "enhanced_ingestion" vs legacy mode
  • SourceFileFormat string - "CSV" or "TEXTFILE"
  • SourceSchema, TargetSchema string - Separate schema names
  • SourceCatalog, TargetCatalog string - Multi-catalog support
  • S3SourceLocation, S3TargetLocation string - Separate S3 paths
  • Engine string - "presto" or "spark" for engine-specific syntax

New functions:

  • isEnhancedIngestionMode() - Mode detection helper
  • generateSourceTable() - Generates 1a-create-source-*.sql
  • generateTargetTable() - Generates 1b-create-target-*.sql

Modified functions:

  • loadSchemas() - In enhanced mode, generates only specified catalog type
    (iceberg=true → Iceberg only) instead of all 4 variants
  • generateSchemaFromDef() - Routes to enhanced or legacy generation logic
    based on mode
  • Run() - Orchestrates enhanced vs legacy workflow

Modified: insert_table.sql.tmpl

  • Added conditional CAST/NULLIF for CSV: CAST(NULLIF(column, '') AS type)
  • Added direct SELECT * for TEXTFILE format
  • Added catalog-qualified table references: source_catalog.source_schema.table

Testing:
✅ All 10 tests pass
✅ Backward compatible - legacy mode unchanged
✅ Generated examples match golden files

Usage:
Enhanced: go run main.go genddl config_enhanced_ingestion.json
Legacy: go run main.go genddl config.json (unchanged)

If source file format is TEXTFILE then insert can be completed without CAST.

Schema creation for target table has a syntax error now fixed.

The sql files are created for the specified schema and catalog and not for every table combinations.
@ron-daniel1 ron-daniel1 requested a review from ethanyzhang as a code owner April 1, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant