Skip to content

[BUG][Dataloader] preserve column casing in DataFusion SQL dialect to fix camelCase column lookups#536

Merged
ShreyeshArangath merged 4 commits intolinkedin:mainfrom
ShreyeshArangath:bug/lowercase
Apr 8, 2026
Merged

[BUG][Dataloader] preserve column casing in DataFusion SQL dialect to fix camelCase column lookups#536
ShreyeshArangath merged 4 commits intolinkedin:mainfrom
ShreyeshArangath:bug/lowercase

Conversation

@ShreyeshArangath
Copy link
Copy Markdown
Collaborator

@ShreyeshArangath ShreyeshArangath commented Apr 8, 2026

Summary

The DataFusion dialect's NORMALIZATION_STRATEGY was set to LOWERCASE, causing sqlglot to lowercase all identifiers during SQL optimization. This broke tables with camelCase columns (e.g. viewerId, feedPosition) because both DataFusion execution and PyIceberg scans are case-sensitive.

Change the strategy to CASE_SENSITIVE, which matches DataFusion's actual behavior and preserves original identifier casing throughout the pipeline.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…e column lookups

The DataFusion dialect's NORMALIZATION_STRATEGY was set to LOWERCASE,
causing sqlglot to lowercase all identifiers during SQL optimization.
This broke tables with camelCase columns (e.g. viewerId, feedPosition)
because both DataFusion execution and PyIceberg scans are case-sensitive.

Change the strategy to CASE_SENSITIVE, which matches DataFusion's actual
behavior and preserves original identifier casing throughout the pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review April 8, 2026 20:23
ShreyeshArangath and others added 3 commits April 8, 2026 20:25
… tests

Makes the test data truly ambiguous — all three columns lowercase to
"userid", so a lowercasing dialect would collapse them into one column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-case tests

Renames generic userId/USERID/UserID to purchaseAmount/PURCHASEAMOUNT/
PurchaseAmount for better readability while preserving the case-collision
property that makes the tests meaningful.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces colliding casing variants (purchaseAmount/PURCHASEAMOUNT/
PurchaseAmount) with distinct descriptive columns (purchaseAmount,
itemCount, discountRate) that better represent a real schema.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ShreyeshArangath ShreyeshArangath merged commit f9fccaa into linkedin:main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants