Add NeMo Conformer RNNT support for character-level tokenizers by hlevring · Pull Request #9 · ysdede/transformers.js

hlevring · 2026-03-05T15:29:09Z

Background

I made some adjustments to your transformers.js branch in order to support RNN-T models. Specifically, I needed support for nvidia/parakeet-rnnt-110m-da-dk, so I converted the model to ONNX (hlevring/parakeet-rnnt-110m-da-dk-onnx), tested with your branch of transformers.js, and made the necessary adjustments.

PS: This model transcribes without punctuation and capitalization, so I had to prepare a separate model for that: hlevring/bert-punct-restoration-da-onnx

Anyway, this PR just includes the basics to support RNNT-type models. It may not be worth merging, but figured I would make the PR anyway.

Changes

modeling_nemo_conformer_tdt.js:

Add configurable encoder_length_dtype (int32/int64) to transducer config. Some RNNT encoder exports (e.g. the Danish parakeet-rnnt-110m) require int32 length inputs rather than the default int64. Defaults to int64 for backward compatibility.
Fix JSDoc type for confidenceFromLogits logits parameter.

transducer_text.js:

Add fallback word-boundary detection for character-level tokenizers. SentencePiece tokenizers used by some RNNT models emit single-character tokens without ▁/Ġ word-start markers. When the initial pass produces only a single "word" despite many tokens, a second pass decodes the full token sequence and uses whitespace to infer word boundaries, enabling correct word-level timestamps and confidences.

package.json:

Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228 to align with the onnxruntime-web 1.25.0-dev version already in use and resolve onnxruntime-common peer dependency conflicts.

Summary by Sourcery

Add configurable RNNT transducer encoder length dtype and improve word segmentation for character-level tokenizers while updating ONNX runtime dependency.

Enhancements:

Add support for configurable encoder length dtype in NeMo Conformer RNNT transducer configs and model feeds to handle both int32 and int64 length tensors.
Improve transducer text post-processing by adding a fallback word-boundary detection pass for character-level tokenizers to produce accurate word-level outputs.
Correct the documented logits parameter type for the confidence computation helper to use the shared tensor data array type.

Build:

Update onnxruntime-node dependency to a 1.25.0 dev build to align with the onnxruntime-web version.

Some RNNT models (e.g. Danish parakeet-rnnt-110m) use character-level SentencePiece tokenizers that lack word-start markers and require int32 encoder length inputs. This commit adds the necessary support. modeling_nemo_conformer_tdt.js: - Add configurable encoder_length_dtype (int32/int64) to transducer config, defaulting to int64 for backward compatibility - Fix JSDoc type for confidenceFromLogits logits parameter transducer_text.js: - Add fallback word-boundary detection for character-level tokenizers that emit tokens without word-start markers, enabling correct word-level timestamps and confidences package.json: - Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228 to align with onnxruntime-web 1.25.0-dev and resolve onnxruntime-common peer dependency conflicts

sourcery-ai · 2026-03-05T15:29:15Z

Reviewer's Guide

Adds configurable encoder length dtype support and improves word-boundary handling for character-level RNNT tokenizers in NeMo Conformer TDT, along with aligning onnxruntime-node to the dev 1.25.0 version used by onnxruntime-web.

Sequence diagram for RNNT transducer text word-boundary fallback

sequenceDiagram
    actor Client
    participant NemoConformerForTDT as NemoConformerForTDT
    participant buildTransducerDetailedOutputs as buildTransducerDetailedOutputs
    participant tokenizer as tokenizer

    Client->>NemoConformerForTDT: generate_transcript()
    NemoConformerForTDT->>buildTransducerDetailedOutputs: buildTransducerDetailedOutputs(tokenizer, token_ids, token_times)

    rect rgb(235, 235, 255)
        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: initial pass using token word_start markers
        buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: words array computed
    end

    alt words.length <= 1 and tokens.length > 1
        buildTransducerDetailedOutputs->>tokenizer: decode(token_ids, skip_special_tokens, clean_up_tokenization_spaces=False)
        tokenizer-->>buildTransducerDetailedOutputs: fullDecoded

        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: reset words and current word state
        loop for each token j
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: skip whitespace in fullDecoded
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: determine startsNewWord
            buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: update tokens[j].is_word_start
            alt startsNewWord
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(previous current)
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: start new current word
            else same word
                buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: append token to current word
            end
        end
        buildTransducerDetailedOutputs->>buildTransducerDetailedOutputs: finalizeAndPushWord(last current)
    else sufficient word boundaries from tokens
        buildTransducerDetailedOutputs-->>buildTransducerDetailedOutputs: keep initial words
    end

    buildTransducerDetailedOutputs-->>NemoConformerForTDT: word texts, timings, confidences
    NemoConformerForTDT-->>Client: detailed transcript output

Class diagram for updated Nemo Conformer TDT transducer components

classDiagram
    class TransducerConfig {
        +number frame_shift_s
        +number blank_token_id
        +string encoder_output_layout
        +string encoder_input_layout
        +string encoder_frame_layout
        +string encoder_length_dtype
        +string decoder_token_dtype
        +string decoder_token_length_dtype
    }

    class NemoConformerTDTPreTrainedModel {
        <<abstract>>
        +any transducer
    }

    class NemoConformerForTDT {
        +any transducer
        +forward(inputFeatures)
        -createEncoderLengthTensor(length)
    }

    NemoConformerTDTPreTrainedModel <|-- NemoConformerForTDT
    TransducerConfig --> NemoConformerForTDT : uses

    class Tokenizer {
        +string decode(number[] token_ids, any options)
    }

    class Token {
        +string token
        +number start_time
        +number end_time
        +number confidence
        +boolean is_word_start
    }

    class Word {
        +string text
        +number start
        +number end
        +number[] confs
        +number confidence
    }

    class TransducerTextUtils {
        +buildTransducerDetailedOutputs(Tokenizer tokenizer, number[] token_ids, number[] token_times)
        -finalizeAndPushWord(Word[] words, Word current)
    }

    TransducerTextUtils --> Tokenizer : uses
    TransducerTextUtils --> Token : aggregates
    TransducerTextUtils --> Word : aggregates

File-Level Changes

Change	Details	Files
Add configurable encoder length dtype for transducer encoder lengths and use it when constructing ONNX length tensors.	Extend transducer config resolution to read encoder_length_dtype with a default of int64. Validate encoder_length_dtype against allowed values int32/int64 and surface a clear error on invalid values. Propagate encoder_length_dtype into the resolved transducer configuration object. Conditionally construct encoder length ONNX tensors as int32 or int64 based on encoder_length_dtype.	`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`
Improve word-boundary detection for character-level RNNT tokenizers to produce correct word-level timestamps and confidences.	Add a fallback path when initial word segmentation yields a single word but multiple tokens are present. Decode the full token sequence without cleanup to recover whitespace, then derive word starts from whitespace positions. Rebuild words and their timestamps/confidence arrays using the inferred word boundaries and update tokens' is_word_start flags accordingly.	`packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js`
Align onnxruntime-node dependency with the dev 1.25.0 series already used by onnxruntime-web.	Bump onnxruntime-node from 1.24.2 to 1.25.0-dev.20260228-6e72d31970 in the transformers package.json to avoid peer dependency conflicts.	`packages/transformers/package.json`
Tighten typing for confidenceFromLogits logits argument.	Update JSDoc for confidenceFromLogits logits parameter to use the shared DataArray type instead of a narrowed union.	`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've left some high level feedback:

In the character-level tokenizer fallback, consider guarding the pos += tokens[j].token.length / fullDecoded[pos] logic so you bail out or stop advancing once pos reaches fullDecoded.length, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization).
The fallback currently runs whenever words.length <= 1 && tokens.length > 1; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In the character-level tokenizer fallback, consider guarding the `pos += tokens[j].token.length` / `fullDecoded[pos]` logic so you bail out or stop advancing once `pos` reaches `fullDecoded.length`, to avoid relying on implicit string bounds checks when tokenization and decoded text lengths diverge (e.g. due to normalization).
- The fallback currently runs whenever `words.length <= 1 && tokens.length > 1`; you might want to narrow this condition (e.g. to specific tokenizer types or when the initial pass produced a single very long word) to reduce the risk of incorrectly re-segmenting text for non-character-level models that incidentally meet this condition.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gemini-code-assist

Code Review

This pull request introduces support for NeMo Conformer RNNT models, focusing on those with character-level tokenizers. The changes are well-structured, adding a configurable encoder_length_dtype and a fallback for word boundary detection. My review identifies one area for improvement in transducer_text.js concerning code duplication, which could be refactored to enhance maintainability. The other changes, including the dependency update and JSDoc correction, are solid.

gemini-code-assist · 2026-03-05T15:31:17Z

packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js

+    if (words.length <= 1 && tokens.length > 1) {
+        const fullDecoded = tokenizer
+            .decode(token_ids, { skip_special_tokens: true, clean_up_tokenization_spaces: false })
+            .trimStart();
+
+        words.length = 0;
+        current = null;
+        let pos = 0;
+
+        for (let j = 0; j < tokens.length; j++) {
+            let foundSpace = false;
+            while (pos < fullDecoded.length && /\s/.test(fullDecoded[pos])) {
+                foundSpace = true;
+                pos++;
+            }
+
+            const startsNewWord = j === 0 || foundSpace;
+            tokens[j].is_word_start = startsNewWord;
+            pos += tokens[j].token.length;
+
+            if (!current || startsNewWord) {
+                finalizeAndPushWord(words, current);
+                current = {
+                    text: tokens[j].token,
+                    start: tokens[j].start_time,
+                    end: tokens[j].end_time,
+                    confs: tokens[j].confidence != null ? [tokens[j].confidence] : [],
+                };
+            } else {
+                current.text += tokens[j].token;
+                current.end = tokens[j].end_time;
+                if (tokens[j].confidence != null) {
+                    current.confs.push(tokens[j].confidence);
+                }
+            }
+        }
+
+        finalizeAndPushWord(words, current);
+    }


This new fallback block introduces significant code duplication. The logic for building words by creating or extending the current word object (lines 193-207) is nearly identical to the logic in the preceding loop (lines 151-165).

To improve maintainability and avoid redundancy, I recommend refactoring this duplicated logic into a separate helper function. This function could take the tokens array and be responsible for building the words array. You could then call it once for the initial word construction and again within this fallback block after updating the is_word_start flags.

Guard pos against exceeding decoded text length in word-boundary fallback

coderabbitai · 2026-03-06T06:23:39Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2948e335-5719-470c-b344-c762a8ddc94d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kilo-code-bot · 2026-03-06T06:25:44Z

packages/transformers/package.json

    "@huggingface/jinja": "^0.5.5",
    "@huggingface/tokenizers": "^0.1.2",
-    "onnxruntime-node": "1.24.2",
+    "onnxruntime-node": "1.25.0-dev.20260228-6e72d31970",


WARNING: Dependency on dev version

Updating onnxruntime-node from stable 1.24.2 to 1.25.0-dev.20260228-6e72d31970 introduces a pre-release/dev version dependency. This could:

Introduce instability in production environments

Cause compatibility issues with existing ONNX models

Make debugging harder due to non-stable APIs

Consider pinning to a stable release version instead.

kilo-code-bot · 2026-03-06T06:26:18Z

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	1
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`packages/transformers/package.json`	60	Dev version dependency for onnxruntime-node

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File	Line	Issue
`packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js`	193-207	Code duplication in fallback block - Already noted in existing review comment

Files Reviewed (3 files)

packages/transformers/package.json - 1 issue
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js - No issues
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js - Covered by existing comment

Analysis Summary

Changes Overview:
This PR introduces:

Updated onnxruntime-node dependency to a dev version
New encoder_length_dtype config option for the Nemo Conformer TDT model
Fallback logic for character-level tokenizers (e.g., Danish RNNT)

Positive Aspects:

The encoder_length_dtype feature adds flexibility for different ONNX model configurations
Proper validation is implemented for the new config option
The fallback logic for character-level tokenizers addresses a real use case

Concerns:

The dev version dependency (1.25.0-dev.20260228-6e72d31970) could introduce instability

Performance Review:
No significant performance concerns identified. The changes either add configuration options or implement fallback logic that only triggers under specific conditions (when words.length <= 1 and tokens.length > 1).

Security Review:
No concrete security issues identified in this diff. The changes involve:

Dependency version update (no security implications from the version change itself)
Config validation with allowlist (safe pattern)
Text processing logic with no external input handling

Reliability Review:

Config validation is properly implemented with appropriate error messages
The fallback logic includes bounds checking with Math.min()
No obvious race conditions or resource leaks

Test Review:
No tests were modified or added in this PR. Given the changes involve:

New config option
Fallback tokenizer logic

Consider adding tests to verify:

The encoder_length_dtype config is properly passed to the ONNX runtime
The character-level tokenizer fallback produces correct word boundaries

Merge Recommendation

Approve with concerns - The dev version dependency should be addressed before merging to production. Consider pinning to a stable release version.

hlevring · 2026-03-06T06:32:55Z

1. Guard pos against going past fullDecoded.length

Fixed now.

Comments from cursor

2. Narrow the words.length <= 1 && tokens.length > 1 condition (Not fixed, not a real concern)
The concern is that a non-character-level model might accidentally trigger this. In practice, this is extremely unlikely. If a model has ▁/Ġ word-start markers, the first pass will produce multiple words, so words.length <= 1 will be false. The only way to trigger it is if every token in a multi-token sequence lacks word-start markers, which is precisely the character-level tokenizer case. Adding a tokenizer type check would be fragile and over-engineered. I'd leave the condition as-is.

3. Code duplication (Not worth fixing)

The two loops look similar but serve different purposes:
The first loop (lines 151-165) builds words from is_word_start flags determined by resolveTokenPiece (checking ▁/Ġ markers)
The fallback (lines 193-207) builds words from whitespace positions in the decoded string
A shared helper would need to accept different inputs and contexts, making it harder to follow. The duplication is ~15 lines and both paths are self-contained. I'd leave it.

sourcery-ai bot reviewed Mar 5, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

Repository owner deleted a comment from coderabbitai bot Mar 5, 2026

Repository owner deleted a comment from gemini-code-assist bot Mar 5, 2026

Repository owner deleted a comment from kilo-code-bot bot Mar 5, 2026

fix: guard pos against exceeding decoded text length

a801d84

Guard pos against exceeding decoded text length in word-boundary fallback

kilo-code-bot bot reviewed Mar 6, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 8, 2026

feat: add Nemo Conformer TDT support (sentence-based pipeline refresh) #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NeMo Conformer RNNT support for character-level tokenizers#9

Add NeMo Conformer RNNT support for character-level tokenizers#9
hlevring wants to merge 2 commits intoysdede:v4-nemo-conformer-tdt-mainfrom
hlevring:v4-nemo-conformer-tdt-main

hlevring commented Mar 5, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Mar 5, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Review skipped

Uh oh!

kilo-code-bot bot Mar 6, 2026

Uh oh!

kilo-code-bot bot commented Mar 6, 2026

WARNING

Uh oh!

hlevring commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hlevring commented Mar 5, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for RNNT transducer text word-boundary fallback

Class diagram for updated Nemo Conformer TDT transducer components

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Mar 6, 2026

Review skipped

Uh oh!

kilo-code-bot bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Mar 6, 2026

Code Review Summary

Overview

WARNING

Analysis Summary

Merge Recommendation

Uh oh!

hlevring commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hlevring commented Mar 5, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 5, 2026 •

edited

Loading