Skip to content

Add NeMo Conformer RNNT support for character-level tokenizers#9

Open
hlevring wants to merge 2 commits intoysdede:v4-nemo-conformer-tdt-mainfrom
hlevring:v4-nemo-conformer-tdt-main
Open

Add NeMo Conformer RNNT support for character-level tokenizers#9
hlevring wants to merge 2 commits intoysdede:v4-nemo-conformer-tdt-mainfrom
hlevring:v4-nemo-conformer-tdt-main

Conversation

@hlevring
Copy link

@hlevring hlevring commented Mar 5, 2026

Background

I made some adjustments to your transformers.js branch in order to support RNN-T models. Specifically, I needed support for nvidia/parakeet-rnnt-110m-da-dk, so I converted the model to ONNX (hlevring/parakeet-rnnt-110m-da-dk-onnx), tested with your branch of transformers.js, and made the necessary adjustments.

PS: This model transcribes without punctuation and capitalization, so I had to prepare a separate model for that: hlevring/bert-punct-restoration-da-onnx

Anyway, this PR just includes the basics to support RNNT-type models. It may not be worth merging, but figured I would make the PR anyway.

Changes

modeling_nemo_conformer_tdt.js:

  • Add configurable encoder_length_dtype (int32/int64) to transducer config. Some RNNT encoder exports (e.g. the Danish parakeet-rnnt-110m) require int32 length inputs rather than the default int64. Defaults to int64 for backward compatibility.
  • Fix JSDoc type for confidenceFromLogits logits parameter.

transducer_text.js:

  • Add fallback word-boundary detection for character-level tokenizers. SentencePiece tokenizers used by some RNNT models emit single-character tokens without /Ġ word-start markers. When the initial pass produces only a single "word" despite many tokens, a second pass decodes the full token sequence and uses whitespace to infer word boundaries, enabling correct word-level timestamps and confidences.

package.json:

  • Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228 to align with the onnxruntime-web 1.25.0-dev version already in use and resolve onnxruntime-common peer dependency conflicts.

Summary by Sourcery

Add configurable RNNT transducer encoder length dtype and improve word segmentation for character-level tokenizers while updating ONNX runtime dependency.

Enhancements:

  • Add support for configurable encoder length dtype in NeMo Conformer RNNT transducer configs and model feeds to handle both int32 and int64 length tensors.
  • Improve transducer text post-processing by adding a fallback word-boundary detection pass for character-level tokenizers to produce accurate word-level outputs.
  • Correct the documented logits parameter type for the confidence computation helper to use the shared tensor data array type.

Build:

  • Update onnxruntime-node dependency to a 1.25.0 dev build to align with the onnxruntime-web version.

Some RNNT models (e.g. Danish parakeet-rnnt-110m) use character-level
SentencePiece tokenizers that lack word-start markers and require int32
encoder length inputs. This commit adds the necessary support.

modeling_nemo_conformer_tdt.js:
- Add configurable encoder_length_dtype (int32/int64) to transducer
  config, defaulting to int64 for backward compatibility
- Fix JSDoc type for confidenceFromLogits logits parameter

transducer_text.js:
- Add fallback word-boundary detection for character-level tokenizers
  that emit tokens without word-start markers, enabling correct
  word-level timestamps and confidences

package.json:
- Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228 to align
  with onnxruntime-web 1.25.0-dev and resolve onnxruntime-common
  peer dependency conflicts
@sourcery-ai
Copy link

sourcery-ai bot commented Mar 5, 2026

Reviewer's Guide

Adds configurable encoder length dtype support and improves word-boundary handling for character-level RNNT tokenizers in NeMo Conformer TDT, along with aligning onnxruntime-node to the dev 1.25.0 version used by onnxruntime-web.

Sequence diagram for RNNT transducer text word-boundary fallback

sequenceDiagram
    actor Client
    participant NemoConformerForTDT as NemoConformerForTDT
    participant buildTransducerDetailedOutputs as buildTransducerDetailedOutputs
    participant tokenizer as tokenizer

    Client->>NemoConformerForTDT: generate_transcript()
    NemoConformerForTDT->>buildTransducerDetailedOutputs: buildTransducerDetailedOutputs(tokenizer, token_ids, token_times)

    rect rgb(235, 235, 255)
        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: initial pass using token word_start markers
        buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: words array computed
    end

    alt words.length <= 1 and tokens.length > 1
        buildTransducerDetailedOutputs->>tokenizer: decode(token_ids, skip_special_tokens, clean_up_tokenization_spaces=False)
        tokenizer-->>buildTransducerDetailedOutputs: fullDecoded

        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: reset words and current word state
        loop for each token j
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: skip whitespace in fullDecoded
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: determine startsNewWord
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: update tokens[j].is_word_start
            alt startsNewWord
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(previous current)
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: start new current word
            else same word
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: append token to current word
            end
        end
        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(last current)
    else sufficient word boundaries from tokens
        buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: keep initial words
    end

    buildTransducerDetailedOutputs-->>NemoConformerForTDT: word texts, timings, confidences
    NemoConformerForTDT-->>Client: detailed transcript output
Loading

Class diagram for updated Nemo Conformer TDT transducer components

classDiagram
    class TransducerConfig {
        +number frame_shift_s
        +number blank_token_id
        +string encoder_output_layout
        +string encoder_input_layout
        +string encoder_frame_layout
        +string encoder_length_dtype
        +string decoder_token_dtype
        +string decoder_token_length_dtype
    }

    class NemoConformerTDTPreTrainedModel {
        <<abstract>>
        +any transducer
    }

    class NemoConformerForTDT {
        +any transducer
        +forward(inputFeatures)
        -createEncoderLengthTensor(length)
    }

    NemoConformerTDTPreTrainedModel <|-- NemoConformerForTDT
    TransducerConfig --> NemoConformerForTDT : uses

    class Tokenizer {
        +string decode(number[] token_ids, any options)
    }

    class Token {
        +string token
        +number start_time
        +number end_time
        +number confidence
        +boolean is_word_start
    }

    class Word {
        +string text
        +number start
        +number end
        +number[] confs
        +number confidence
    }

    class TransducerTextUtils {
        +buildTransducerDetailedOutputs(Tokenizer tokenizer, number[] token_ids, number[] token_times)
        -finalizeAndPushWord(Word[] words, Word current)
    }

    TransducerTextUtils --> Tokenizer : uses
    TransducerTextUtils --> Token : aggregates
    TransducerTextUtils --> Word : aggregates
Loading

File-Level Changes

Change Details Files
Add configurable encoder length dtype for transducer encoder lengths and use it when constructing ONNX length tensors.
  • Extend transducer config resolution to read encoder_length_dtype with a default of int64.
  • Validate encoder_length_dtype against allowed values int32/int64 and surface a clear error on invalid values.
  • Propagate encoder_length_dtype into the resolved transducer configuration object.
  • Conditionally construct encoder length ONNX tensors as int32 or int64 based on encoder_length_dtype.
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
Improve word-boundary detection for character-level RNNT tokenizers to produce correct word-level timestamps and confidences.
  • Add a fallback path when initial word segmentation yields a single word but multiple tokens are present.
  • Decode the full token sequence without cleanup to recover whitespace, then derive word starts from whitespace positions.
  • Rebuild words and their timestamps/confidence arrays using the inferred word boundaries and update tokens' is_word_start flags accordingly.
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
Align onnxruntime-node dependency with the dev 1.25.0 series already used by onnxruntime-web.
  • Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228-6e72d31970 in the transformers package.json to avoid peer dependency conflicts.
packages/transformers/package.json
Tighten typing for confidenceFromLogits logits argument.
  • Update JSDoc for confidenceFromLogits logits parameter to use the shared DataArray type instead of a narrowed union.
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the character-level tokenizer fallback, consider guarding the pos += tokens[j].token.length / fullDecoded[pos] logic so you bail out or stop advancing once pos reaches fullDecoded.length, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization).
  • The fallback currently runs whenever words.length <= 1 && tokens.length > 1; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the character-level tokenizer fallback, consider guarding the `pos += tokens[j].token.length` / `fullDecoded[pos]` logic so you bail out or stop advancing once `pos` reaches `fullDecoded.length`, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization).
- The fallback currently runs whenever `words.length <= 1 && tokens.length > 1`; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for NeMo Conformer RNNT models, focusing on those with character-level tokenizers. The changes are well-structured, adding a configurable encoder_length_dtype and a fallback for word boundary detection. My review identifies one area for improvement in transducer_text.js concerning code duplication, which could be refactored to enhance maintainability. The other changes, including the dependency update and JSDoc correction, are solid.

Comment on lines +173 to +211
if (words.length <= 1 && tokens.length > 1) {
const fullDecoded = tokenizer
.decode(token_ids, { skip_special_tokens: true, clean_up_tokenization_spaces: false })
.trimStart();

words.length = 0;
current = null;
let pos = 0;

for (let j = 0; j < tokens.length; j++) {
let foundSpace = false;
while (pos < fullDecoded.length && /\s/.test(fullDecoded[pos])) {
foundSpace = true;
pos++;
}

const startsNewWord = j === 0 || foundSpace;
tokens[j].is_word_start = startsNewWord;
pos += tokens[j].token.length;

if (!current || startsNewWord) {
finalizeAndPushWord(words, current);
current = {
text: tokens[j].token,
start: tokens[j].start_time,
end: tokens[j].end_time,
confs: tokens[j].confidence != null ? [tokens[j].confidence] : [],
};
} else {
current.text += tokens[j].token;
current.end = tokens[j].end_time;
if (tokens[j].confidence != null) {
current.confs.push(tokens[j].confidence);
}
}
}

finalizeAndPushWord(words, current);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new fallback block introduces significant code duplication. The logic for building words by creating or extending the current word object (lines 193-207) is nearly identical to the logic in the preceding loop (lines 151-165).

To improve maintainability and avoid redundancy, I recommend refactoring this duplicated logic into a separate helper function. This function could take the tokens array and be responsible for building the words array. You could then call it once for the initial word construction and again within this fallback block after updating the is_word_start flags.

Repository owner deleted a comment from coderabbitai bot Mar 5, 2026
Repository owner deleted a comment from gemini-code-assist bot Mar 5, 2026
Repository owner deleted a comment from kilo-code-bot bot Mar 5, 2026
Guard pos against exceeding decoded text length in word-boundary fallback
@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2948e335-5719-470c-b344-c762a8ddc94d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

"@huggingface/jinja": "^0.5.5",
"@huggingface/tokenizers": "^0.1.2",
"onnxruntime-node": "1.24.2",
"onnxruntime-node": "1.25.0-dev.20260228-6e72d31970",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Dependency on dev version

Updating onnxruntime-node from stable 1.24.2 to 1.25.0-dev.20260228-6e72d31970 introduces a pre-release/dev version dependency. This could:

  • Introduce instability in production environments
  • Cause compatibility issues with existing ONNX models
  • Make debugging harder due to non-stable APIs

Consider pinning to a stable release version instead.

@kilo-code-bot
Copy link

kilo-code-bot bot commented Mar 6, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
packages/transformers/package.json 60 Dev version dependency for onnxruntime-node
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js 193-207 Code duplication in fallback block - Already noted in existing review comment
Files Reviewed (3 files)
  • packages/transformers/package.json - 1 issue
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js - No issues
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js - Covered by existing comment

Analysis Summary

Changes Overview:
This PR introduces:

  1. Updated onnxruntime-node dependency to a dev version
  2. New encoder_length_dtype config option for the Nemo Conformer TDT model
  3. Fallback logic for character-level tokenizers (e.g., Danish RNNT)

Positive Aspects:

  • The encoder_length_dtype feature adds flexibility for different ONNX model configurations
  • Proper validation is implemented for the new config option
  • The fallback logic for character-level tokenizers addresses a real use case

Concerns:

  • The dev version dependency (1.25.0-dev.20260228-6e72d31970) could introduce instability

Performance Review:
No significant performance concerns identified. The changes either add configuration options or implement fallback logic that only triggers under specific conditions (when words.length <= 1 and tokens.length > 1).

Security Review:
No concrete security issues identified in this diff. The changes involve:

  • Dependency version update (no security implications from the version change itself)
  • Config validation with allowlist (safe pattern)
  • Text processing logic with no external input handling

Reliability Review:

  • Config validation is properly implemented with appropriate error messages
  • The fallback logic includes bounds checking with Math.min()
  • No obvious race conditions or resource leaks

Test Review:
No tests were modified or added in this PR. Given the changes involve:

  • New config option
  • Fallback tokenizer logic

Consider adding tests to verify:

  1. The encoder_length_dtype config is properly passed to the ONNX runtime
  2. The character-level tokenizer fallback produces correct word boundaries

Merge Recommendation

Approve with concerns - The dev version dependency should be addressed before merging to production. Consider pinning to a stable release version.

@hlevring
Copy link
Author

hlevring commented Mar 6, 2026

1. Guard pos against going past fullDecoded.length

Fixed now.

Comments from cursor

2. Narrow the words.length <= 1 && tokens.length > 1 condition (Not fixed, not a real concern)
The concern is that a non-character-level model might accidentally trigger this. In practice, this is extremely unlikely. If a model has ▁/Ġ word-start markers, the first pass will produce multiple words, so words.length <= 1 will be false. The only way to trigger it is if every token in a multi-token sequence lacks word-start markers, which is precisely the character-level tokenizer case. Adding a tokenizer type check would be fragile and over-engineered. I'd leave the condition as-is.

3. Code duplication (Not worth fixing)

The two loops look similar but serve different purposes:
The first loop (lines 151-165) builds words from is_word_start flags determined by resolveTokenPiece (checking ▁/Ġ markers)
The fallback (lines 193-207) builds words from whitespace positions in the decoded string
A shared helper would need to accept different inputs and contexts, making it harder to follow. The duplication is ~15 lines and both paths are self-contained. I'd leave it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant