Feature Request: Improved subtitle segmentation using language-aware post-processing

### Summary

Improve the quality of automatically generated subtitles by introducing a language-aware post-processing step that refines block segmentation and line breaks.

The goal is to move beyond purely rule-based splitting and achieve more natural, readable subtitles that better follow linguistic structure.

---

### Background

Current subtitle generation is primarily driven by deterministic rules such as:

- character limits  
- reading speed  
- timing and pauses  

While this ensures technically valid subtitles, it often leads to:

- unnatural line breaks  
- splitting of phrases that should stay together  
- suboptimal grouping of text into blocks  
- reduced readability, especially for longer or more complex speech  

High-quality subtitle segmentation requires an understanding of language, not just timing and length.

---

### Proposed Functionality

Introduce a refinement step that:

- reviews an already generated subtitle segmentation  
- adjusts block boundaries where appropriate  
- improves line breaks within blocks  
- takes into account both timing and linguistic structure  

This step should work conservatively:

- existing segmentation should be kept if already acceptable  
- changes should only be made where there is clear improvement  

---

### Suggested Approach

To achieve language-aware segmentation, this feature may leverage a language model (e.g. LLM or similar) capable of evaluating linguistic structure in context.

The model would act as a refinement layer on top of the existing rule-based segmentation, focusing on improving readability rather than generating new text.

The exact implementation is intentionally left open.

---

### Important Constraints

- The original transcription must remain unchanged  
- No words may be added, removed, or altered  
- Only segmentation (blocks and line breaks) may be adjusted  

---

### Linguistic Guidelines

The refinement step should aim to follow established subtitle and readability principles, including:

- Keep syntactic units together where possible  
- Avoid splitting:
  - verb + particle (e.g. "ta upp", "gå igenom")  
  - auxiliary + main verb  
  - prepositional phrases  
  - names and fixed expressions  
- Prefer breaks at natural clause or sentence boundaries  
- Avoid leaving very short trailing lines  
- Aim for balanced line lengths within a block  
- Ensure each subtitle block forms a coherent and readable unit  
- Use pauses in speech as guidance for segmentation, but not as the sole deciding factor  

---

### Expected Improvements

- More natural line breaks  
- Better grouping of words into readable units  
- Reduced need for manual editing  
- Subtitles that align better with established captioning practices  

---

### Acceptance Criteria

- Subtitles remain technically valid (timing, line limits, etc.)  
- No changes to the underlying text content  
- Improved readability compared to current output  
- Clear reduction in manual corrections needed in the editor  

---

### Notes

This feature is intended as a quality improvement layer on top of the existing subtitle generation pipeline, not a replacement of it.

Implementation details (e.g. model choice, architecture, prompt design) are intentionally left open.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Improved subtitle segmentation using language-aware post-processing #29

Summary

Background

Proposed Functionality

Suggested Approach

Important Constraints

Linguistic Guidelines

Expected Improvements

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Improved subtitle segmentation using language-aware post-processing #29

Description

Summary

Background

Proposed Functionality

Suggested Approach

Important Constraints

Linguistic Guidelines

Expected Improvements

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions