Skip to content

Feature Request: Improved subtitle segmentation using language-aware post-processing #29

@erklu

Description

@erklu

Summary

Improve the quality of automatically generated subtitles by introducing a language-aware post-processing step that refines block segmentation and line breaks.

The goal is to move beyond purely rule-based splitting and achieve more natural, readable subtitles that better follow linguistic structure.


Background

Current subtitle generation is primarily driven by deterministic rules such as:

  • character limits
  • reading speed
  • timing and pauses

While this ensures technically valid subtitles, it often leads to:

  • unnatural line breaks
  • splitting of phrases that should stay together
  • suboptimal grouping of text into blocks
  • reduced readability, especially for longer or more complex speech

High-quality subtitle segmentation requires an understanding of language, not just timing and length.


Proposed Functionality

Introduce a refinement step that:

  • reviews an already generated subtitle segmentation
  • adjusts block boundaries where appropriate
  • improves line breaks within blocks
  • takes into account both timing and linguistic structure

This step should work conservatively:

  • existing segmentation should be kept if already acceptable
  • changes should only be made where there is clear improvement

Suggested Approach

To achieve language-aware segmentation, this feature may leverage a language model (e.g. LLM or similar) capable of evaluating linguistic structure in context.

The model would act as a refinement layer on top of the existing rule-based segmentation, focusing on improving readability rather than generating new text.

The exact implementation is intentionally left open.


Important Constraints

  • The original transcription must remain unchanged
  • No words may be added, removed, or altered
  • Only segmentation (blocks and line breaks) may be adjusted

Linguistic Guidelines

The refinement step should aim to follow established subtitle and readability principles, including:

  • Keep syntactic units together where possible
  • Avoid splitting:
    • verb + particle (e.g. "ta upp", "gå igenom")
    • auxiliary + main verb
    • prepositional phrases
    • names and fixed expressions
  • Prefer breaks at natural clause or sentence boundaries
  • Avoid leaving very short trailing lines
  • Aim for balanced line lengths within a block
  • Ensure each subtitle block forms a coherent and readable unit
  • Use pauses in speech as guidance for segmentation, but not as the sole deciding factor

Expected Improvements

  • More natural line breaks
  • Better grouping of words into readable units
  • Reduced need for manual editing
  • Subtitles that align better with established captioning practices

Acceptance Criteria

  • Subtitles remain technically valid (timing, line limits, etc.)
  • No changes to the underlying text content
  • Improved readability compared to current output
  • Clear reduction in manual corrections needed in the editor

Notes

This feature is intended as a quality improvement layer on top of the existing subtitle generation pipeline, not a replacement of it.

Implementation details (e.g. model choice, architecture, prompt design) are intentionally left open.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions