Skip to content

⚡️ Speed up function text_to_word_sequence by 45%#21

Open
codeflash-ai[bot] wants to merge 1 commit intomasterfrom
codeflash/optimize-text_to_word_sequence-maxn1e67
Open

⚡️ Speed up function text_to_word_sequence by 45%#21
codeflash-ai[bot] wants to merge 1 commit intomasterfrom
codeflash/optimize-text_to_word_sequence-maxn1e67

Conversation

@codeflash-ai
Copy link
Copy Markdown

@codeflash-ai codeflash-ai bot commented May 21, 2025

📄 45% (0.45x) speedup for text_to_word_sequence in keras/src/legacy/preprocessing/text.py

⏱️ Runtime : 1.16 millisecond 805 microseconds (best of 259 runs)

📝 Explanation and details

Here’s an optimized version of your text_to_word_sequence function.
The main improvements are.

  • Eliminate building the translate dict every call: Instead, cache the translation table (str.maketrans) for the default filters and split values.
  • Use generator expression with filter for filtering empty strings: This is faster than a list comprehension.
  • Short-circuit the split step for a single space: When split is a space (the default), str.split() without an argument splits on any whitespace and skips empty strings, removing the need for post-filtering.

Key optimizations:

  • The translation table and comprehension for the default case are only built once.
  • The main work of splitting and filtering is handled by the C-optimized str.split() when the split char is a space.
  • For non-default cases, extra processing is minimized.

This should cut the runtime and memory allocations, especially for repeated calls using default parameters.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 75 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests Details
import pytest  # used for our unit tests
# function to test
from keras.src.api_export import keras_export
from keras.src.legacy.preprocessing.text import text_to_word_sequence

# unit tests

# ----------------------------
# BASIC TEST CASES
# ----------------------------

def test_basic_simple_sentence():
    # Basic sentence with punctuation
    codeflash_output = text_to_word_sequence("Hello, world!")

def test_basic_no_punctuation():
    # Sentence with no punctuation
    codeflash_output = text_to_word_sequence("The quick brown fox")

def test_basic_uppercase():
    # Sentence with all uppercase letters
    codeflash_output = text_to_word_sequence("HELLO WORLD")

def test_basic_mixed_case():
    # Sentence with mixed case
    codeflash_output = text_to_word_sequence("PyThOn Is AwEsOmE")

def test_basic_custom_filters():
    # Use a custom filter
    codeflash_output = text_to_word_sequence("a.b,c", filters=".,")

def test_basic_custom_split():
    # Use a custom split character
    codeflash_output = text_to_word_sequence("a-b-c", filters="", split="-")

def test_basic_lower_false():
    # lower=False should preserve case
    codeflash_output = text_to_word_sequence("Hello World", lower=False)

def test_basic_unicode_letters():
    # Unicode letters should be preserved
    codeflash_output = text_to_word_sequence("café naïve façade")

def test_basic_with_tabs_and_newlines():
    # Tabs and newlines are filtered by default
    codeflash_output = text_to_word_sequence("foo\tbar\nbaz")

# ----------------------------
# EDGE TEST CASES
# ----------------------------

def test_edge_empty_string():
    # Empty string should return empty list
    codeflash_output = text_to_word_sequence("")

def test_edge_only_punctuation():
    # Only punctuation should return empty list
    codeflash_output = text_to_word_sequence("!!!...,,,")

def test_edge_only_spaces():
    # Only spaces should return empty list
    codeflash_output = text_to_word_sequence("     ")

def test_edge_leading_and_trailing_spaces():
    # Leading/trailing spaces should not affect result
    codeflash_output = text_to_word_sequence("   hello world   ")

def test_edge_multiple_consecutive_punctuation():
    # Multiple consecutive punctuation marks
    codeflash_output = text_to_word_sequence("hello,,,world!!")

def test_edge_multiple_consecutive_splits():
    # Multiple consecutive split characters
    codeflash_output = text_to_word_sequence("a  b   c")

def test_edge_split_in_filters():
    # If split char is also in filters, splitting should still work
    codeflash_output = text_to_word_sequence("a,b,c", filters=",", split=",")

def test_edge_custom_filter_removes_letters():
    # Custom filter that removes letters
    codeflash_output = text_to_word_sequence("abc", filters="b")

def test_edge_non_string_input():
    # Non-string input should raise AttributeError
    with pytest.raises(AttributeError):
        text_to_word_sequence(1234)

def test_edge_empty_filters():
    # No filters: punctuation is preserved
    codeflash_output = text_to_word_sequence("hello, world!", filters="")

def test_edge_empty_split():
    # Split cannot be empty string, should raise ValueError
    with pytest.raises(ValueError):
        text_to_word_sequence("hello world", split="")


def test_edge_split_is_tab():
    # Split on tab character
    codeflash_output = text_to_word_sequence("a\tb\tc", filters="", split="\t")

def test_edge_non_ascii_filters():
    # Filters with non-ascii characters
    codeflash_output = text_to_word_sequence("a♥b♦c", filters="♥♦")

def test_edge_all_filters_removed():
    # All characters are filters: should return empty list
    codeflash_output = text_to_word_sequence("abc", filters="abc")

def test_edge_long_split():
    # Split is more than one character (should work, as str.split allows it)
    codeflash_output = text_to_word_sequence("a--b--c", filters="", split="--")

def test_edge_split_is_digit():
    # Split on digit character
    codeflash_output = text_to_word_sequence("a1b1c", filters="", split="1")

def test_edge_word_within_filters():
    # Word contains a filter character
    codeflash_output = text_to_word_sequence("e-mail", filters="-")

def test_edge_word_within_split():
    # Word contains split character
    codeflash_output = text_to_word_sequence("e-mail", filters="", split="-")

def test_edge_filters_overlap_split():
    # Filter and split overlap: should not double-remove
    codeflash_output = text_to_word_sequence("a,b,c", filters=",", split=",")

# ----------------------------
# LARGE SCALE TEST CASES
# ----------------------------

def test_large_scale_long_sentence():
    # Very long sentence of repeated words
    long_text = "word " * 999
    expected = ["word"] * 999
    codeflash_output = text_to_word_sequence(long_text.strip())

def test_large_scale_many_unique_words():
    # Sentence with many unique words
    words = [f"word{i}" for i in range(999)]
    text = " ".join(words)
    codeflash_output = text_to_word_sequence(text)

def test_large_scale_long_word():
    # Very long single word
    long_word = "a" * 1000
    codeflash_output = text_to_word_sequence(long_word)

def test_large_scale_with_punctuation():
    # Long sentence with punctuation between every word
    words = [f"word{i}," for i in range(999)]
    text = " ".join(words)
    expected = [f"word{i}" for i in range(999)]
    codeflash_output = text_to_word_sequence(text)

def test_large_scale_custom_split_and_filters():
    # Large text with custom split and filters
    text = "|".join([f"foo!bar?baz{i}" for i in range(500)])
    expected = []
    for i in range(500):
        expected.extend(["foo", "bar", f"baz{i}"])
    codeflash_output = text_to_word_sequence(text, filters="!?")

def test_large_scale_unicode():
    # Large text with unicode characters
    words = [f"café{i} naïve{i} façade{i}" for i in range(300)]
    text = " ".join(words)
    expected = []
    for i in range(300):
        expected.extend([f"café{i}", f"naïve{i}", f"façade{i}"])
    codeflash_output = text_to_word_sequence(text)

def test_large_scale_all_filters():
    # Text of only filter characters, repeated
    filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    text = filters * 20
    codeflash_output = text_to_word_sequence(text)

def test_large_scale_alternating_filters_and_words():
    # Alternating filter and word, repeated
    text = ("!word," * 500)
    expected = ["word"] * 500
    codeflash_output = text_to_word_sequence(text)

def test_large_scale_multiple_spaces():
    # Large text with multiple spaces between words
    text = ("word    " * 500).strip()
    expected = ["word"] * 500
    codeflash_output = text_to_word_sequence(text)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest  # used for our unit tests
# function to test
from keras.src.api_export import keras_export
from keras.src.legacy.preprocessing.text import text_to_word_sequence

# unit tests

# --------- BASIC TEST CASES ---------

def test_basic_simple_sentence():
    # Basic sentence, default filters, lowercasing
    codeflash_output = text_to_word_sequence("The quick brown fox.")

def test_basic_no_punctuation():
    # Sentence with no punctuation
    codeflash_output = text_to_word_sequence("Hello world this is a test")

def test_basic_punctuation_removal():
    # Sentence with multiple punctuation marks
    codeflash_output = text_to_word_sequence("Hello, world! This is a test...")

def test_basic_uppercase():
    # Sentence with uppercase letters
    codeflash_output = text_to_word_sequence("HELLO WORLD")

def test_basic_lower_false():
    # lower=False should preserve original casing
    codeflash_output = text_to_word_sequence("HELLO World", lower=False)

def test_basic_custom_filters():
    # Custom filters: remove only '!'
    codeflash_output = text_to_word_sequence("Hello! World!", filters="!")

def test_basic_custom_split():
    # Custom split: use comma instead of space
    codeflash_output = text_to_word_sequence("apple,banana,carrot", split=",")

def test_basic_multiple_spaces():
    # Input with multiple consecutive spaces
    codeflash_output = text_to_word_sequence("hello   world")

def test_basic_tab_and_newline():
    # Input with tabs and newlines
    codeflash_output = text_to_word_sequence("hello\tworld\nfoo")

# --------- EDGE TEST CASES ---------

def test_edge_empty_string():
    # Empty string should return an empty list
    codeflash_output = text_to_word_sequence("")

def test_edge_only_punctuation():
    # String with only punctuation should return empty list
    codeflash_output = text_to_word_sequence("!!!...,,,;;;")

def test_edge_only_spaces():
    # String with only spaces
    codeflash_output = text_to_word_sequence("     ")

def test_edge_only_split_character():
    # String with only split character (e.g. comma)
    codeflash_output = text_to_word_sequence(",,,", split=",")

def test_edge_leading_trailing_spaces():
    # Leading and trailing spaces should be ignored
    codeflash_output = text_to_word_sequence("   hello world   ")

def test_edge_filters_overlap_with_split():
    # Filters and split character overlap (e.g. split is ',' and ',' is in filters)
    codeflash_output = text_to_word_sequence("a,b,c", filters=",", split=",")

def test_edge_filters_remove_split():
    # Filters remove the split character, so split becomes a no-op
    codeflash_output = text_to_word_sequence("a,b,c", filters=",", split=",")

def test_edge_unicode_characters():
    # Unicode characters should be preserved unless in filters
    codeflash_output = text_to_word_sequence("café naïve résumé")

def test_edge_unicode_in_filters():
    # Unicode character in filters should be removed
    codeflash_output = text_to_word_sequence("café naïve résumé", filters="é")

def test_edge_non_string_input():
    # Non-string input should raise AttributeError (since .lower() called)
    with pytest.raises(AttributeError):
        text_to_word_sequence(12345)

def test_edge_empty_filters():
    # No filters: punctuation should remain as part of words
    codeflash_output = text_to_word_sequence("hello, world!", filters="")

def test_edge_empty_split():
    # Empty split should raise ValueError
    with pytest.raises(ValueError):
        text_to_word_sequence("abc", split="")

def test_edge_split_is_filter():
    # Split character is also in filters, should still work
    codeflash_output = text_to_word_sequence("a-b-c", filters="-", split="-")

def test_edge_multiple_consecutive_filters():
    # Multiple consecutive filter characters
    codeflash_output = text_to_word_sequence("hello!!!world??")

def test_edge_mixed_whitespace():
    # Mixed whitespace (spaces, tabs, newlines)
    codeflash_output = text_to_word_sequence("hello \t\n world")

def test_edge_split_on_tab():
    # Split on tab character
    codeflash_output = text_to_word_sequence("a\tb\tc", split="\t", filters="")

def test_edge_split_on_newline():
    # Split on newline character
    codeflash_output = text_to_word_sequence("a\nb\nc", split="\n", filters="")

def test_edge_word_with_filter_inside():
    # Word containing filter character inside (should split)
    codeflash_output = text_to_word_sequence("foo-bar-baz", filters="-")

def test_edge_word_with_split_inside():
    # Word containing split character inside, not in filters
    codeflash_output = text_to_word_sequence("foo-bar-baz", filters="", split="-")

def test_edge_multiple_split_characters():
    # Multiple split characters in a row
    codeflash_output = text_to_word_sequence("a,,b,,,c", split=",", filters="")

def test_edge_split_and_filter_are_space():
    # Both split and filter are space (should not break)
    codeflash_output = text_to_word_sequence("a b  c", filters=" ", split=" ")

# --------- LARGE SCALE TEST CASES ---------

def test_large_scale_long_sentence():
    # Long sentence with 1000 words
    sentence = "word " * 1000
    expected = ["word"] * 1000
    codeflash_output = text_to_word_sequence(sentence.strip())

def test_large_scale_large_unique_words():
    # 1000 unique words separated by spaces
    words = [f"word{i}" for i in range(1000)]
    sentence = " ".join(words)
    codeflash_output = text_to_word_sequence(sentence)

def test_large_scale_long_punctuation():
    # 1000 punctuation marks between words
    sentence = "word" + ("!" * 1000) + "test"
    codeflash_output = text_to_word_sequence(sentence)

def test_large_scale_many_split_chars():
    # 999 split characters between two words
    sentence = "foo" + ("," * 999) + "bar"
    codeflash_output = text_to_word_sequence(sentence, split=",", filters="")

def test_large_scale_mixed_filters_and_words():
    # 500 words, each followed by a punctuation mark
    words = [f"word{i}" for i in range(500)]
    sentence = " ".join(w + "!" for w in words)
    codeflash_output = text_to_word_sequence(sentence)

def test_large_scale_unicode_words():
    # 500 unicode words
    words = [f"naïve{i}" for i in range(500)]
    sentence = " ".join(words)
    codeflash_output = text_to_word_sequence(sentence)

def test_large_scale_large_filters():
    # Use a large filters string (all ASCII printable except letters and digits)
    import string
    filters = ''.join(c for c in string.printable if not c.isalnum() and c not in " ")
    sentence = "word1" + filters + "word2"
    codeflash_output = text_to_word_sequence(sentence, filters=filters)

def test_large_scale_performance():
    # Not a true performance test, but ensures function works with large input
    sentence = "word " * 999 + "end"
    codeflash_output = text_to_word_sequence(sentence); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-text_to_word_sequence-maxn1e67 and push.

Codeflash

Here’s an optimized version of your `text_to_word_sequence` function.  
The main improvements are.

- **Eliminate building the translate dict every call:** Instead, cache the translation table (`str.maketrans`) for the default `filters` and `split` values.
- **Use generator expression with `filter` for filtering empty strings:** This is faster than a list comprehension.
- **Short-circuit the `split` step for a single space:** When `split` is a space (the default), `str.split()` without an argument splits on any whitespace and skips empty strings, removing the need for post-filtering.




**Key optimizations:**
- The translation table and comprehension for the default case are only built once.
- The main work of splitting and filtering is handled by the C-optimized `str.split()` when the split char is a space.
- For non-default cases, extra processing is minimized.

This should cut the runtime and memory allocations, especially for repeated calls using default parameters.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 21, 2025
@codeflash-ai codeflash-ai bot requested a review from HeshamHM28 May 21, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants