Skip to content

HIVE-27370: support 4 bytes characters#6340

Open
ryukobayashi wants to merge 16 commits intoapache:masterfrom
ryukobayashi:HIVE-27370
Open

HIVE-27370: support 4 bytes characters#6340
ryukobayashi wants to merge 16 commits intoapache:masterfrom
ryukobayashi:HIVE-27370

Conversation

@ryukobayashi
Copy link
Contributor

@ryukobayashi ryukobayashi commented Feb 27, 2026

What changes were proposed in this pull request?

If a SUBSTR UDF has a 4-byte characters in its parameter, the behavior is different between vectorized and non-vectorized. The vectorized version handles 4-byte characters properly, but the non-vectorized version does not, so similar logic is needed.
And these fixes use vectorized logic:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/StringSubstrColStartLen.java#L89-L130
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/StringSubstrColStart.java#L78-L109

Previous PR: #5624

Why are the changes needed?

Vectorized and non-vectorized have different results.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added pattern tests to itest for these to work correctly.

@ryukobayashi
Copy link
Contributor Author

@okumin, Last time, I didn't have time to work on it, so this PR(#5624) was automatically closed. So I created another one. The previous PR required extensive testing, but we have already verified it through our internal simulations and found no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants