Add NeMo Conformer RNNT support for character-level tokenizers#9
Add NeMo Conformer RNNT support for character-level tokenizers#9hlevring wants to merge 2 commits intoysdede:v4-nemo-conformer-tdt-mainfrom
Conversation
Some RNNT models (e.g. Danish parakeet-rnnt-110m) use character-level SentencePiece tokenizers that lack word-start markers and require int32 encoder length inputs. This commit adds the necessary support. modeling_nemo_conformer_tdt.js: - Add configurable encoder_length_dtype (int32/int64) to transducer config, defaulting to int64 for backward compatibility - Fix JSDoc type for confidenceFromLogits logits parameter transducer_text.js: - Add fallback word-boundary detection for character-level tokenizers that emit tokens without word-start markers, enabling correct word-level timestamps and confidences package.json: - Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228 to align with onnxruntime-web 1.25.0-dev and resolve onnxruntime-common peer dependency conflicts
Reviewer's GuideAdds configurable encoder length dtype support and improves word-boundary handling for character-level RNNT tokenizers in NeMo Conformer TDT, along with aligning onnxruntime-node to the dev 1.25.0 version used by onnxruntime-web. Sequence diagram for RNNT transducer text word-boundary fallbacksequenceDiagram
actor Client
participant NemoConformerForTDT as NemoConformerForTDT
participant buildTransducerDetailedOutputs as buildTransducerDetailedOutputs
participant tokenizer as tokenizer
Client->>NemoConformerForTDT: generate_transcript()
NemoConformerForTDT->>buildTransducerDetailedOutputs: buildTransducerDetailedOutputs(tokenizer, token_ids, token_times)
rect rgb(235, 235, 255)
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: initial pass using token word_start markers
buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: words array computed
end
alt words.length <= 1 and tokens.length > 1
buildTransducerDetailedOutputs->>tokenizer: decode(token_ids, skip_special_tokens, clean_up_tokenization_spaces=False)
tokenizer-->>buildTransducerDetailedOutputs: fullDecoded
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: reset words and current word state
loop for each token j
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: skip whitespace in fullDecoded
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: determine startsNewWord
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: update tokens[j].is_word_start
alt startsNewWord
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(previous current)
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: start new current word
else same word
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: append token to current word
end
end
buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(last current)
else sufficient word boundaries from tokens
buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: keep initial words
end
buildTransducerDetailedOutputs-->>NemoConformerForTDT: word texts, timings, confidences
NemoConformerForTDT-->>Client: detailed transcript output
Class diagram for updated Nemo Conformer TDT transducer componentsclassDiagram
class TransducerConfig {
+number frame_shift_s
+number blank_token_id
+string encoder_output_layout
+string encoder_input_layout
+string encoder_frame_layout
+string encoder_length_dtype
+string decoder_token_dtype
+string decoder_token_length_dtype
}
class NemoConformerTDTPreTrainedModel {
<<abstract>>
+any transducer
}
class NemoConformerForTDT {
+any transducer
+forward(inputFeatures)
-createEncoderLengthTensor(length)
}
NemoConformerTDTPreTrainedModel <|-- NemoConformerForTDT
TransducerConfig --> NemoConformerForTDT : uses
class Tokenizer {
+string decode(number[] token_ids, any options)
}
class Token {
+string token
+number start_time
+number end_time
+number confidence
+boolean is_word_start
}
class Word {
+string text
+number start
+number end
+number[] confs
+number confidence
}
class TransducerTextUtils {
+buildTransducerDetailedOutputs(Tokenizer tokenizer, number[] token_ids, number[] token_times)
-finalizeAndPushWord(Word[] words, Word current)
}
TransducerTextUtils --> Tokenizer : uses
TransducerTextUtils --> Token : aggregates
TransducerTextUtils --> Word : aggregates
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the character-level tokenizer fallback, consider guarding the
pos += tokens[j].token.length/fullDecoded[pos]logic so you bail out or stop advancing onceposreachesfullDecoded.length, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization). - The fallback currently runs whenever
words.length <= 1 && tokens.length > 1; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the character-level tokenizer fallback, consider guarding the `pos += tokens[j].token.length` / `fullDecoded[pos]` logic so you bail out or stop advancing once `pos` reaches `fullDecoded.length`, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization).
- The fallback currently runs whenever `words.length <= 1 && tokens.length > 1`; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces support for NeMo Conformer RNNT models, focusing on those with character-level tokenizers. The changes are well-structured, adding a configurable encoder_length_dtype and a fallback for word boundary detection. My review identifies one area for improvement in transducer_text.js concerning code duplication, which could be refactored to enhance maintainability. The other changes, including the dependency update and JSDoc correction, are solid.
| if (words.length <= 1 && tokens.length > 1) { | ||
| const fullDecoded = tokenizer | ||
| .decode(token_ids, { skip_special_tokens: true, clean_up_tokenization_spaces: false }) | ||
| .trimStart(); | ||
|
|
||
| words.length = 0; | ||
| current = null; | ||
| let pos = 0; | ||
|
|
||
| for (let j = 0; j < tokens.length; j++) { | ||
| let foundSpace = false; | ||
| while (pos < fullDecoded.length && /\s/.test(fullDecoded[pos])) { | ||
| foundSpace = true; | ||
| pos++; | ||
| } | ||
|
|
||
| const startsNewWord = j === 0 || foundSpace; | ||
| tokens[j].is_word_start = startsNewWord; | ||
| pos += tokens[j].token.length; | ||
|
|
||
| if (!current || startsNewWord) { | ||
| finalizeAndPushWord(words, current); | ||
| current = { | ||
| text: tokens[j].token, | ||
| start: tokens[j].start_time, | ||
| end: tokens[j].end_time, | ||
| confs: tokens[j].confidence != null ? [tokens[j].confidence] : [], | ||
| }; | ||
| } else { | ||
| current.text += tokens[j].token; | ||
| current.end = tokens[j].end_time; | ||
| if (tokens[j].confidence != null) { | ||
| current.confs.push(tokens[j].confidence); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| finalizeAndPushWord(words, current); | ||
| } |
There was a problem hiding this comment.
This new fallback block introduces significant code duplication. The logic for building words by creating or extending the current word object (lines 193-207) is nearly identical to the logic in the preceding loop (lines 151-165).
To improve maintainability and avoid redundancy, I recommend refactoring this duplicated logic into a separate helper function. This function could take the tokens array and be responsible for building the words array. You could then call it once for the initial word construction and again within this fallback block after updating the is_word_start flags.
Guard pos against exceeding decoded text length in word-boundary fallback
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| "@huggingface/jinja": "^0.5.5", | ||
| "@huggingface/tokenizers": "^0.1.2", | ||
| "onnxruntime-node": "1.24.2", | ||
| "onnxruntime-node": "1.25.0-dev.20260228-6e72d31970", |
There was a problem hiding this comment.
WARNING: Dependency on dev version
Updating onnxruntime-node from stable 1.24.2 to 1.25.0-dev.20260228-6e72d31970 introduces a pre-release/dev version dependency. This could:
- Introduce instability in production environments
- Cause compatibility issues with existing ONNX models
- Make debugging harder due to non-stable APIs
Consider pinning to a stable release version instead.
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (3 files)
Analysis SummaryChanges Overview:
Positive Aspects:
Concerns:
Performance Review: Security Review:
Reliability Review:
Test Review:
Consider adding tests to verify:
Merge RecommendationApprove with concerns - The dev version dependency should be addressed before merging to production. Consider pinning to a stable release version. |
|
1. Guard pos against going past fullDecoded.length Fixed now. Comments from cursor 2. Narrow the words.length <= 1 && tokens.length > 1 condition (Not fixed, not a real concern) 3. Code duplication (Not worth fixing) The two loops look similar but serve different purposes: |
Background
I made some adjustments to your transformers.js branch in order to support RNN-T models. Specifically, I needed support for nvidia/parakeet-rnnt-110m-da-dk, so I converted the model to ONNX (hlevring/parakeet-rnnt-110m-da-dk-onnx), tested with your branch of transformers.js, and made the necessary adjustments.
PS: This model transcribes without punctuation and capitalization, so I had to prepare a separate model for that: hlevring/bert-punct-restoration-da-onnx
Anyway, this PR just includes the basics to support RNNT-type models. It may not be worth merging, but figured I would make the PR anyway.
Changes
modeling_nemo_conformer_tdt.js:
encoder_length_dtype(int32/int64) to transducer config. Some RNNT encoder exports (e.g. the Danish parakeet-rnnt-110m) require int32 length inputs rather than the default int64. Defaults to int64 for backward compatibility.confidenceFromLogitslogits parameter.transducer_text.js:
▁/Ġword-start markers. When the initial pass produces only a single "word" despite many tokens, a second pass decodes the full token sequence and uses whitespace to infer word boundaries, enabling correct word-level timestamps and confidences.package.json:
onnxruntime-nodefrom 1.24.2 to 1.25.0-dev.20260228 to align with theonnxruntime-web1.25.0-dev version already in use and resolveonnxruntime-commonpeer dependency conflicts.Summary by Sourcery
Add configurable RNNT transducer encoder length dtype and improve word segmentation for character-level tokenizers while updating ONNX runtime dependency.
Enhancements:
Build: