Maven plugin for automatically translating Markdown files using LLM (Large Language Model) engines. The plugin recursively traverses directories, identifies Markdown files, and translates them into one or multiple target languages while preserving formatting and structure.
- Automatic Translation: Translates Markdown files using configurable LLM providers (OpenAI, Anthropic)
- Incremental Updates: Only translates files that have changed since the last translation
- Git Integration: Tracks file changes using Git history for intelligent incremental updates
- Link Validation: Validates internal links before translation to prevent broken references
- Parallel Processing: Configurable parallelism for faster batch translations
- Dry Run Mode: Preview what would be translated without making changes
- Custom Instructions: Per-directory translation instructions via
.comenius-instructionsfiles - Large Document Splitting: Automatically splits documents exceeding 32kB at heading boundaries for translation
- Section-Based Incremental Updates: Incremental translations compare document sections by hash and only retranslate changed sections
- Heading Structure Validation: Validates that LLM output preserves the source heading structure
- Cross-Document Anchor Correction: Automatically fixes stale anchor references in other translated files after retranslation
- Fuzzy Anchor Matching: Uses Levenshtein distance and token overlap for anchor correction in translated documents
Add the plugin to your pom.xml:
<plugin>
<groupId>one.edee.oss</groupId>
<artifactId>comenius-maven-plugin</artifactId>
<version>1.0.1-SNAPSHOT</version>
<configuration>
<llmProvider>openai</llmProvider>
<llmUrl>https://api.openai.com/v1</llmUrl>
<llmToken>${env.OPENAI_API_KEY}</llmToken>
<llmModel>gpt-4o</llmModel>
<sourceDir>docs/en</sourceDir>
<targets>
<target>
<locale>de</locale>
<targetDir>docs/de</targetDir>
</target>
<target>
<locale>fr</locale>
<targetDir>docs/fr</targetDir>
</target>
<target>
<locale>es</locale>
<targetDir>docs/es</targetDir>
</target>
</targets>
</configuration>
</plugin>The plugin provides four actions via the comenius.action parameter:
| Action | Description |
|---|---|
show-config |
Displays current plugin configuration (default) |
check |
Validates files - checks Git status and link validity |
translate |
Executes the translation workflow |
fix-links |
Corrects links in all translated files |
| Parameter | Property | Default | Description |
|---|---|---|---|
action |
comenius.action |
show-config |
Action to perform |
llmProvider |
comenius.llmProvider |
openai |
LLM provider: openai or anthropic |
llmUrl |
comenius.llmUrl |
- | LLM API endpoint URL |
llmToken |
comenius.llmToken |
- | API authentication token |
llmModel |
comenius.llmModel |
gpt-4o |
Model name to use |
sourceDir |
comenius.sourceDir |
- | Source directory containing files to translate |
fileRegex |
comenius.fileRegex |
(?i).*\.md |
Regex pattern to match files |
targets |
comenius.targets |
- | List of target languages and directories |
limit |
comenius.limit |
2147483647 |
Maximum number of files to process |
dryRun |
comenius.dryRun |
false |
When true, simulates without writing |
parallelism |
comenius.parallelism |
4 |
Number of parallel translation threads |
excludedFilePatterns |
comenius.excludedFilePatterns |
- | List of regex patterns to exclude directories/files |
translatableFrontMatterFields |
comenius.translatableFrontMatterFields |
- | Front matter fields to translate (e.g., title, perex) |
customFrontMatter |
comenius.customFrontMatter |
- | Custom key-value pairs to add to translated files' front matter |
Follow this step-by-step approach when setting up translations for your project:
First, check that your configuration is correct:
mvn comenius:run -Dcomenius.action=show-configThis displays all configured parameters and warns about missing required values.
Before translating, validate that all source files are properly committed and links are valid:
mvn comenius:run -Dcomenius.action=checkThe check action verifies:
- All matched files are committed to Git (no uncommitted changes)
- All internal links point to existing files
- No broken references that would cause issues in translations
Fix any reported errors before proceeding.
Preview what would be translated without making any changes:
mvn comenius:run -Dcomenius.action=translate -Dcomenius.dryRun=trueThis shows:
- New files: Files that don't exist in the target directory
- Files to update: Files that have changed since last translation
- Skipped files: Files that are already up-to-date
Test the translation with a small number of files first:
mvn comenius:run -Dcomenius.action=translate -Dcomenius.limit=3Review the translated files to ensure quality meets your standards before proceeding with a full translation.
Once satisfied with the test results, run the full translation:
mvn comenius:run -Dcomenius.action=translate Integrate the plugin into your CI/CD pipeline for continuous translation of documentation.
name: Translate Documentation
on:
push:
branches: [ main ]
paths:
- 'docs/en/**'
jobs:
translate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for Git history
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17'
distribution: 'temurin'
- name: Check documentation
run: |
mvn comenius:run \
-Dcomenius.action=check \
-Dcomenius.sourceDir=docs/en
- name: Translate documentation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
mvn comenius:run \
-Dcomenius.action=translate \
-Dcomenius.sourceDir=docs/en \
-Dcomenius.llmUrl=https://api.openai.com/v1 \
-Dcomenius.llmToken=$OPENAI_API_KEY \
-Dcomenius.dryRun=false
- name: Commit translations
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add docs/de docs/fr docs/es
git diff --staged --quiet || git commit -m "chore: update translations"
git pushtranslate-docs:
stage: build
image: maven:3.9-eclipse-temurin-17
only:
changes:
- docs/en/**
script:
- mvn comenius:run -Dcomenius.action=check -Dcomenius.sourceDir=docs/en
- mvn comenius:run
-Dcomenius.action=translate
-Dcomenius.sourceDir=docs/en
-Dcomenius.llmUrl=https://api.openai.com/v1
-Dcomenius.llmToken=$OPENAI_API_KEY
-Dcomenius.dryRun=false
artifacts:
paths:
- docs/de/
- docs/fr/
- docs/es/You can provide per-directory translation instructions using special instruction files:
Create a .comenius-instructions file in any directory containing custom instructions for the translation. The file
directly contains the instruction text that will be passed to the LLM.
Instructions accumulate as the traverser descends into subdirectories, allowing you to:
- Define project-wide instructions at the root
- Add topic-specific instructions in subdirectories
Use .comenius-instructions.replace instead to reset instruction accumulation and start fresh with only the
instructions in that file.
docs/en/
├── .comenius-instructions # Contains project-wide glossary and style guide
├── getting-started/
│ ├── .comenius-instructions # Contains API-specific terminology
│ └── quickstart.md # Translated with root + getting-started instructions
└── advanced/
├── .comenius-instructions.replace # Resets and contains advanced-only instructions
└── architecture.md # Translated with only advanced instructions
A .comenius-instructions file might contain:
Use the following terminology consistently:
- "evitaDB" - never translate, always keep as-is
- "entity" -> "Entität" (German)
- "attribute" -> "Attribut" (German)
Style guidelines:
- Use formal "Sie" form in German translations
- Keep code examples unchanged
- Preserve all markdown formatting
By default, YAML front matter at the beginning of Markdown files is not translated. This includes fields like
author, date, motive, and other metadata that should remain unchanged.
However, some front matter fields contain user-facing text that should be translated, such as title, perex, or
description. You can configure which fields should be translated using translatableFrontMatterFields.
<configuration>
<translatableFrontMatterFields>
<field>title</field>
<field>perex</field>
<field>description</field>
</translatableFrontMatterFields>
</configuration>Source file (English):
---
title: Getting Started
perex: Learn how to set up and configure your first project
author: John Doe
date: 2024-01-15
---
# Getting Started
...Translated file (German):
---
title: Erste Schritte
perex: Erfahren Sie, wie Sie Ihr erstes Projekt einrichten und konfigurieren
author: John Doe
date: 2024-01-15
commit: abc123def456
---
# Erste Schritte
...Note that:
- Only
titleandperexare translated (as configured) authoranddateremain unchanged- The
commitfield is automatically added to track the source version
| Field | Description |
|---|---|
title |
Page or article title |
perex |
Short description or lead paragraph |
description |
Meta description for SEO |
summary |
Brief content summary |
keywords |
SEO keywords (if localized) |
When using incremental translation mode, front matter is always fully retranslated for all configured fields. This is because the plugin cannot safely detect whether changes occurred specifically in front matter fields, so all configured fields are sent to the LLM for translation on every incremental update.
You can add custom key-value pairs to the front matter of all translated files. This is useful for distinguishing translated files from their originals - for example, to display an "auto-translated" disclaimer on your website.
<configuration>
<customFrontMatter>
<translated>true</translated>
<generator>comenius</generator>
</customFrontMatter>
</configuration>With the configuration above, translated files will include the custom properties:
---
title: Erste Schritte
author: John Doe
translated: 'true'
generator: comenius
commit: abc123def456
---Custom properties are applied after source and translated fields but before the system-managed commit field.
The commit field cannot be overridden via custom front matter.
The fix-links action corrects links in all translated files without performing new translations. This is useful for:
- Fixing links after manual edits to translated files
- Re-running link correction after source file structure changes
- Batch-correcting links across all target directories
mvn comenius:run -Dcomenius.action=fix-links- Asset links - Relative paths to images, PDFs, and other assets are recalculated from the target directory to the source assets
- Anchor links - Internal anchors (e.g.,
#section-title) are corrected using a two-phase fuzzy matching algorithm (see below) - Front matter links - Links in both translatable and non-translatable front matter fields are corrected. Translatable fields receive full markdown link correction. Non-translatable fields are checked for file path values and corrected if the resolved file exists in the source directory.
Anchor references need correction because heading text gets translated, changing the generated slug. The plugin uses a two-phase strategy:
Phase A - Target Language Matching (compare anchor against the translated document's headings):
- Exact match (case-insensitive)
- Levenshtein distance (threshold:
max(2, anchor.length() / 3)) - Token overlap (split on hyphens, strict majority match required)
Phase B - Source Language Matching (if Phase A fails, match against the source document then position-map to translated):
- Find the anchor in the source document's heading index
- Map the matched position to the same position in the translated document's heading index
- Uses the same fuzzy matching strategies (exact, Levenshtein, token overlap) against the source
This two-phase approach handles both minor slug variations and full heading translations.
The fix-links action requires:
sourceDir- The source directory containing original files (used for anchor mapping)targets- List of target directories to process
The following parameters are also respected by fix-links:
fileRegex- Pattern to match files in target directories (default:(?i).*\.md)excludedFilePatterns- Patterns to exclude from processingtranslatableFrontMatterFields- Determines which front matter fields receive full link correctionparallelism- Number of parallel threads for link correction
Note: dryRun and limit have no effect on the fix-links action. It always processes all
matching files and always writes corrections to disk.
mvn comenius:run \
-Dcomenius.action=fix-links \
-Dcomenius.sourceDir=docs/enThe action will process all target directories configured in your pom.xml and:
- Find all markdown files matching the
fileRegexpattern in each target directory - For each file, locate the corresponding source file at the same relative path in
sourceDir - Correct asset links (recalculate paths from target to source assets)
- Correct anchor links (two-phase fuzzy matching against translated and source headings)
- Correct front matter links (translatable fields: full correction; non-translatable: path correction)
- Write corrected files back to disk
- Validate all links after correction and report any remaining errors
Note: Each file in the target directory must have a corresponding source file at the same
relative path. For example, if processing docs/de/guide/intro.md, the source file
docs/en/guide/intro.md must exist for anchor correction to work correctly.
Note: Unlike the translate action, fix-links does not perform cross-document anchor
correction (scanning other translated files for stale references). It only corrects links within
the files it processes.
Use excludedFilePatterns to skip directories or files from processing. This is useful for excluding
asset directories that contain images or other non-translatable content.
<excludedFilePatterns>
<excludedFilePattern>.*/assets/.*</excludedFilePattern>
<excludedFilePattern>.*/images/.*</excludedFilePattern>
<excludedFilePattern>(?i).*/node_modules/.*</excludedFilePattern>
</excludedFilePatterns>- Patterns are matched against the full absolute path of files and directories
- Use
(?i)prefix for case-insensitive matching - Excluded directories are skipped entirely during traversal (efficient for large asset folders)
| Pattern | Excludes |
|---|---|
.*/assets/.* |
All files in any assets directory |
.*/images/.* |
All files in any images directory |
.*/_.*\.md |
Markdown files starting with underscore |
(?i).*/node_modules/.* |
node_modules directories (case-insensitive) |
<configuration>
<llmProvider>openai</llmProvider>
<llmUrl>https://api.openai.com/v1</llmUrl>
<llmToken>${env.OPENAI_API_KEY}</llmToken>
<llmModel>gpt-4o</llmModel>
</configuration><configuration>
<llmProvider>anthropic</llmProvider>
<llmUrl>https://api.anthropic.com</llmUrl>
<llmToken>${env.ANTHROPIC_API_KEY}</llmToken>
<llmModel>claude-sonnet-4-20250514</llmModel>
</configuration><configuration>
<llmProvider>openai</llmProvider>
<llmUrl>https://your-resource.openai.azure.com</llmUrl>
<llmToken>${env.AZURE_OPENAI_KEY}</llmToken>
<llmModel>your-deployment-name</llmModel>
</configuration>After a translation run, the plugin reports:
- Successful: Number of files successfully translated
- Failed: Number of files that failed to translate
- Skipped: Number of files already up-to-date
- Input tokens: Total tokens sent to the LLM
- Output tokens: Total tokens received from the LLM
-
Always run
checkfirst: Ensure all files are committed and links are valid before translating. -
Use dry run: Preview changes before executing translations, especially for large documentation sets.
-
Start with limits: Use
-Dcomenius.limit=5to test with a few files before full runs. -
Version control translations: Commit translated files to track changes over time.
-
Review incrementally: When updating existing translations, review the diff to ensure quality.
-
Secure your tokens: Never commit API tokens. Use environment variables or CI/CD secrets.
-
Monitor token usage: Track input/output tokens to manage API costs.
-
Use appropriate parallelism: Adjust
-Dcomenius.parallelismbased on your API rate limits.
For documents exceeding 32kB, the plugin automatically splits them into smaller chunks based on heading structure and translates each chunk separately. This prevents LLM context window limitations and improves translation quality.
- Target chunk size: 32kB (+/- 20% tolerance, i.e., 25.6kB - 38.4kB per chunk)
- Heading preference: H1 > H2 > H3 > H4 > H5 > H6 (higher-level headings are preferred split points)
- Sequential translation: Chunks are translated one at a time to maintain consistency and respect rate limits
The algorithm prefers splitting at higher-level headings (H1, H2) to maintain logical document sections. If no suitable heading exists within the acceptable size range, it will split at the next available heading.
- Documents under 32kB are translated as a single unit (unchanged behavior)
- Documents over 32kB are automatically split at heading boundaries
- Translated chunks are rejoined in original order
- If any chunk fails, the entire document translation fails
- Content before the first heading (intro content) may become its own chunk if large enough
A 100kB document with this structure:
# Introduction
(content)
# Getting Started
(content)
## Installation
(content)
## Configuration
(content)
# Advanced Topics
(content)Would be split preferentially at H1 headings (# Introduction, # Getting Started, # Advanced Topics),
resulting in approximately 3 chunks that are each translated separately.
When updating existing translations, the plugin uses a section-based approach that compares document sections by content hash and only retranslates sections that have actually changed.
- The source document (both old and new versions) and the existing translation are split into sections at heading boundaries
- Each section's content is hashed using SHA-256
- Old and new source sections are aligned using a Longest Common Subsequence (LCS) algorithm
- Sections are classified as UNCHANGED, MODIFIED, ADDED, or DELETED
- Only MODIFIED and ADDED sections are sent to the LLM for translation
- UNCHANGED sections reuse the existing translation as-is
- DELETED sections are removed from the output
- Reduced token usage: Only changed sections are sent to the LLM, not the entire document
- Stable translations: Unchanged sections keep their existing translations, avoiding unnecessary drift
- Context-aware: Each section is translated with surrounding already-translated sections as context
- Structural safety: Heading structure is validated after each section translation
- Documents are split at ATX-style heading boundaries (
#,##,###, etc.) - Content before the first heading becomes an "intro" section
- Each heading starts a new section
- Sections are flat (not hierarchical) - every heading at any level starts a new section
When translating a section, the LLM receives:
- The section content wrapped in
[[TRANSLATE]]...[[/TRANSLATE]]markers - Preceding already-translated sections (for reference only)
- Following already-translated sections (for reference only)
- Custom translation instructions (if configured)
This ensures translation consistency across sections.
After each section is translated, the plugin validates that the LLM preserved the heading structure (same number of headings at the same levels). If validation fails:
- The plugin retries once with a corrective prompt specifying the exact heading structure required
- If the retry also fails, the translation job is marked as failed
If the existing translation's heading structure doesn't match the old source (e.g., the translation was created before heading validation was added), the plugin falls back to a full retranslation of the entire document instead of section-based updates.
If no source sections have changed (all hashes match), the existing translation is preserved completely unchanged with no LLM calls for the body.
Each file is translated in two separate LLM calls:
-
Phase 1 - Front Matter: Translatable front matter fields (e.g.,
title,perex) are extracted and sent to the LLM as structured[[fieldName]]...[[/fieldName]]blocks. The LLM returns translated fields in the same format. This phase is skipped if notranslatableFrontMatterFieldsare configured. -
Phase 2 - Body: The markdown body is translated. For new files, the entire body is sent (or chunked if over 32kB). For incremental updates, section-based translation is used.
The two phases use separate prompt templates, allowing the LLM to focus on each task independently. Token usage from both phases is accumulated and reported in the summary.
When the translate action runs, it executes the following phases:
- File Collection: Scans the source directory, respects
fileRegex,excludedFilePatterns, andlimit - Job Creation: For each file, creates a
TranslateNewJob(no existing translation) orTranslateIncrementalJob(updating existing translation). Files already up-to-date are skipped. - Translation Execution (skipped if
dryRun=true): Translates files in parallel using a ForkJoinPool. Each job goes through the two-phase translation pipeline. - Link Correction: Fixes asset paths and anchor references in all newly translated files
- Cross-Document Anchor Correction: Scans other existing translated files for stale anchor references pointing to retranslated files, and updates them using fuzzy matching
If the LLM returns a permanent error (e.g., authentication failure, quota exceeded), the plugin:
- Signals the LLM client to reject new requests
- Shuts down the parallel executor immediately
- Reports remaining jobs as cancelled
After translating files, the plugin scans all other translated files in the target directory for anchor references that may have become stale due to heading changes in the retranslated files.
- During translation, the plugin tracks which files had heading anchor changes
- After all translations complete, it scans existing translated files (excluding just-translated ones)
- For each link pointing to a retranslated file, it checks if the anchor still exists
- Stale anchors are corrected using a multi-strategy approach:
- Exact match: Anchor exists in the new document - no change needed
- Position-based mapping: If old and new heading counts match, maps by position
- Fuzzy matching: Levenshtein distance (threshold:
max(2, anchor.length() / 3)) and token overlap (split on hyphens, requires strict majority match)
- Uncorrectable anchors are logged as warnings
Ensure sourceDir is set either in pom.xml configuration or via -Dcomenius.sourceDir.
The plugin requires Git for change tracking. Initialize a Git repository or ensure your source directory is within one.
Run the check action and fix reported issues:
- Commit uncommitted files
- Fix or remove broken links
- Add custom instructions with terminology glossaries
- Use more capable models (e.g.,
gpt-4oinstead ofgpt-4o-mini) - Provide context through instruction files
MIT License