Skip to content

Implement dropout for encode_minimal#98

Open
marinegor wants to merge 12 commits intogithub:mainfrom
marinegor:feature/add-dropout
Open

Implement dropout for encode_minimal#98
marinegor wants to merge 12 commits intogithub:mainfrom
marinegor:feature/add-dropout

Conversation

@marinegor
Copy link

Closes #97

It has been shown that via randomly discarding merges during tokenization process, one can improve tolerance to typos in the input text. This PR implements encode_minimal_dropout for BytePairEncoding via modifying encode_minimal.

@marinegor marinegor requested a review from a team as a code owner February 25, 2026 17:02
use rand::Rng;
assert!(0.0 <= dropout);
assert!(dropout <= 1.0);
let mut rng = rand::rng();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to get reproducible results, the randon number generator should be passed in as argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added seed: Option<u64> argument

state = s;
let mut best = (0, u32::MAX);
for m in iter {
if m.start() == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some paper explaining in more detail how the randomization is supposed to work?
It's not quite obvious what properties the implementation actually has.

Also, some documentation would be nice (as part of some readme and/or doc comment).

If this is a one-to-one implementation of some paper, then we can probably just link to that paper.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper: https://arxiv.org/abs/1910.13267

We're interested in Algorithm 1 (page 3).

Improvements rationale can be seen on Figure 6.

I don't think it's an one-to-one implementation, since encode_minimal does not follow the original BPE, hence its modification won't follow BPE_dropout. But I still see its as a valuable addition (at least I'm planning to use it in my current project).

Copy link
Author

@marinegor marinegor Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...although I admit I don't really understand encode_minimal good enough to ensure I'm actually rejecting merges, and not doing something different (which I probably am).

Intuition is that dropout roughly equals number of rejected merges in the final encoding, e.g. dropout ~=1 would result in almost single-byte encoding. However, I don't see that with dropout=0.99:

 t_1    s_1     t_2    s_2
   1      b       1      b
   2     ab       2     ab
   0      a       0      a
   2     ab       2     ab
   2     ab       2     ab
   2     ab       2     ab
   2     ab       2     ab
  -1      -       0      a   <----
   2     ab       1      b   <----
   1      b       1      b
e1=' b ab  a ab ab ab ab __ ab  b'
e2=' b ab  a ab ab ab ab  a  b  b'

where dictionary is a b ab and string is babaabababababb.

So I'd appreciate any directions if you have any :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_minimal uses dynamic programming, i.e. it models the tokenization as a graph where every position between text bytes is a node and two nodes are connected when the text slice between those two nodes matches a token.
It then tries to find the shortest possible path from the beginning of the text till the end, i.e. it finds the shortest possible encoding.
For this is processes the nodes from left to right and visits all edges to the left. Then, it picks the edge which results in the shortest path. The length of the shortest path is stored as second value, the edge (or rather token) is stored as first value.

Note: this is very different from how BPE works and cannot produce the same output as the algorithm in the paper.

The only implementation in this crate which follows the "standard" BPE algorithm is encode_into_bitfield, since it uses the "standard" heap approach. But instead of storing some complicated doubly linked list or whatever, it uses a compact bitfield to encode the start and end positions of tokens which makes this implementation probably the fastest standard one.
But it is still slow compared to the other algorithms in this crate. But those operate VERY differently and again I'm not sure if it's possible to emulate the exact probability distribution suggested in the paper.

The problem with the algorithm in the paper is that it is VERY slow.
And the dropout implementation I found here: https://github.com/VProv/BPE-Dropout/blob/master/bpe.py#L98 is just as bad (or maybe even worse)?

So, maybe it is good enough to pick a different randomization process which follows the idea of the paper in spirit?
I don't have the time right now to do this research myself though... Happy to review any proposals though.
It would also make sense IMO to think about some procedure which plots somehow the "quality" of the BPE based on the dropout value and compare that graph with the original algorithm.
I mean you probably want some kind of prove that this actually has the desired properties.

Copy link
Author

@marinegor marinegor Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aneubeck thanks for the explanation, that's actually very helpful. I guess that the only thing that matters is just being able to drop some merges before actually building tokenization.

Could you have a look at the updated approach? I've changed the approach that I had before (which I think was very wrong), and instead now consider "best" tokens if they are not in "forbidden_tokens", which have been constructed prior to tokenization. My only worry is the single-byte tokens -- I'm not sure how they're handled, and I wouldn't like to discard them from the allowed tokens, but I'm not sure how to handle that properly. I'm talking about this line:

...
& (!(forbidden_tokens_set.contains(&m.value())) | ((m.end() - m.start()) == 1))
...

I'm not sure if the second condition should be present or not, basically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!
I made some more improvements to your code here
c215210.
Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).

There was a little bug with how you treated tokens which started at the beginning of the text (you didn't filter larger tokens out there...).
I updated the test to detect this.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
Thereby I also found a way to speed up things by another 20-30% by going through the text in reverse order and using the reverse aho corasick lookup table. This way one can avoid the final reversing of the token output which improves throughput further.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).
And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes as well!

Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).
done and thanks, I don't have practically any experience with rngs so appreciate it here.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
I'm thinking it might be technically different from the paper -- from how I'm reading their algorithm, it's impossible to get a dropped token in tokenization once it has been already dropped, while in your implementation it may appear later down the text. But with all fairness, they also split by words first, which at the sentence level makes things the same with your implementation.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).

Will do!

And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

I'll spend some time playing around with a toy example (with a b ab dictionary), and then update the README / tests with that.

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

that's pretty cool :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aneubeck I've added some explanation and updated README slightly

I'm running benchmarks now -- I guess it's simply cargo criterion and cd scripts && ./copy-results, right?

Also, I'm running them on m4 -- should I update the description in README accordingly, or would you prefer to run it on your machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help with implementing dropout

3 participants