Implement dropout for `encode_minimal` by marinegor · Pull Request #98 · github/rust-gems

marinegor · 2026-02-25T17:02:37Z

Closes #97

It has been shown that via randomly discarding merges during tokenization process, one can improve tolerance to typos in the input text. This PR implements encode_minimal_dropout for BytePairEncoding via modifying encode_minimal.

aneubeck · 2026-02-25T18:26:53Z

crates/bpe/src/byte_pair_encoding.rs

+        use rand::Rng;
+        assert!(0.0 <= dropout);
+        assert!(dropout <= 1.0);
+        let mut rng = rand::rng();


in order to get reproducible results, the randon number generator should be passed in as argument.

added seed: Option<u64> argument

aneubeck · 2026-02-25T18:35:04Z

crates/bpe/src/byte_pair_encoding.rs

+            state = s;
+            let mut best = (0, u32::MAX);
+            for m in iter {
+                if m.start() == 0 {


Is there some paper explaining in more detail how the randomization is supposed to work?
It's not quite obvious what properties the implementation actually has.

Also, some documentation would be nice (as part of some readme and/or doc comment).

If this is a one-to-one implementation of some paper, then we can probably just link to that paper.

The paper: https://arxiv.org/abs/1910.13267

We're interested in Algorithm 1 (page 3).

Improvements rationale can be seen on Figure 6.

I don't think it's an one-to-one implementation, since encode_minimal does not follow the original BPE, hence its modification won't follow BPE_dropout. But I still see its as a valuable addition (at least I'm planning to use it in my current project).

...although I admit I don't really understand encode_minimal good enough to ensure I'm actually rejecting merges, and not doing something different (which I probably am).

Intuition is that dropout roughly equals number of rejected merges in the final encoding, e.g. dropout ~=1 would result in almost single-byte encoding. However, I don't see that with dropout=0.99:

t_1 s_1 t_2 s_2 1 b 1 b 2 ab 2 ab 0 a 0 a 2 ab 2 ab 2 ab 2 ab 2 ab 2 ab 2 ab 2 ab -1 - 0 a <---- 2 ab 1 b <---- 1 b 1 b e1=' b ab a ab ab ab ab __ ab b' e2=' b ab a ab ab ab ab a b b'

where dictionary is a b ab and string is babaabababababb.

So I'd appreciate any directions if you have any :)

encode_minimal uses dynamic programming, i.e. it models the tokenization as a graph where every position between text bytes is a node and two nodes are connected when the text slice between those two nodes matches a token.
It then tries to find the shortest possible path from the beginning of the text till the end, i.e. it finds the shortest possible encoding.
For this is processes the nodes from left to right and visits all edges to the left. Then, it picks the edge which results in the shortest path. The length of the shortest path is stored as second value, the edge (or rather token) is stored as first value.

Note: this is very different from how BPE works and cannot produce the same output as the algorithm in the paper.

The only implementation in this crate which follows the "standard" BPE algorithm is encode_into_bitfield, since it uses the "standard" heap approach. But instead of storing some complicated doubly linked list or whatever, it uses a compact bitfield to encode the start and end positions of tokens which makes this implementation probably the fastest standard one.
But it is still slow compared to the other algorithms in this crate. But those operate VERY differently and again I'm not sure if it's possible to emulate the exact probability distribution suggested in the paper.

The problem with the algorithm in the paper is that it is VERY slow.
And the dropout implementation I found here: https://github.com/VProv/BPE-Dropout/blob/master/bpe.py#L98 is just as bad (or maybe even worse)?

So, maybe it is good enough to pick a different randomization process which follows the idea of the paper in spirit?
I don't have the time right now to do this research myself though... Happy to review any proposals though.
It would also make sense IMO to think about some procedure which plots somehow the "quality" of the BPE based on the dropout value and compare that graph with the original algorithm.
I mean you probably want some kind of prove that this actually has the desired properties.

@aneubeck thanks for the explanation, that's actually very helpful. I guess that the only thing that matters is just being able to drop some merges before actually building tokenization.

Could you have a look at the updated approach? I've changed the approach that I had before (which I think was very wrong), and instead now consider "best" tokens if they are not in "forbidden_tokens", which have been constructed prior to tokenization. My only worry is the single-byte tokens -- I'm not sure how they're handled, and I wouldn't like to discard them from the allowed tokens, but I'm not sure how to handle that properly. I'm talking about this line:

... & (!(forbidden_tokens_set.contains(&m.value())) | ((m.end() - m.start()) == 1)) ...

I'm not sure if the second condition should be present or not, basically.

Thanks for the changes!
I made some more improvements to your code here
c215210.
Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).

There was a little bug with how you treated tokens which started at the beginning of the text (you didn't filter larger tokens out there...).
I updated the test to detect this.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
Thereby I also found a way to speed up things by another 20-30% by going through the text in reverse order and using the reverse aho corasick lookup table. This way one can avoid the final reversing of the token output which improves throughput further.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).
And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

thanks for the changes as well!

Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).
done and thanks, I don't have practically any experience with rngs so appreciate it here.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
I'm thinking it might be technically different from the paper -- from how I'm reading their algorithm, it's impossible to get a dropped token in tokenization once it has been already dropped, while in your implementation it may appear later down the text. But with all fairness, they also split by words first, which at the sentence level makes things the same with your implementation.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).

Will do!

And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

I'll spend some time playing around with a toy example (with a b ab dictionary), and then update the README / tests with that.

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

that's pretty cool :)

@aneubeck I've added some explanation and updated README slightly

I'm running benchmarks now -- I guess it's simply cargo criterion and cd scripts && ./copy-results, right?

Also, I'm running them on m4 -- should I update the description in README accordingly, or would you prefer to run it on your machine?

crates/bpe/tests/src/lib.rs

marinegor added 3 commits February 25, 2026 17:51

implement dropout for minimal encoding

113a2fe

make test text longer

db7dd30

make lint

aa49a37

marinegor requested a review from a team as a code owner February 25, 2026 17:02

aneubeck reviewed Feb 25, 2026

View reviewed changes

crates/bpe/tests/src/lib.rs Show resolved Hide resolved

marinegor and others added 9 commits February 25, 2026 20:38

implement review comments

a336729

update docs

23c8ec9

implement dropout with forbidden_tokens

08d4200

check single-len tokens as well

a10cce2

VERY Fast dropout implementation

c215210

merge aneubeck version

8ad286a

remove duplicated code

5493b12

update README, docs, and include dropout into benchmarks

3d63504

update benchmarks and README description

bba0765

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dropout for `encode_minimal`#98

Implement dropout for `encode_minimal`#98
marinegor wants to merge 12 commits intogithub:mainfrom
marinegor:feature/add-dropout

marinegor commented Feb 25, 2026

Uh oh!

aneubeck Feb 25, 2026

Uh oh!

marinegor Feb 25, 2026

Uh oh!

aneubeck Feb 25, 2026

Uh oh!

marinegor Feb 25, 2026

Uh oh!

marinegor Feb 25, 2026 •

edited

Loading

Uh oh!

aneubeck Feb 26, 2026

Uh oh!

marinegor Feb 26, 2026 •

edited

Loading

Uh oh!

aneubeck Feb 27, 2026

Uh oh!

marinegor Feb 27, 2026

Uh oh!

marinegor Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marinegor commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marinegor Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marinegor Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marinegor Feb 25, 2026 •

edited

Loading

marinegor Feb 26, 2026 •

edited

Loading