Skip to content

fix: improve Calibre artifact cleanup#5

Open
fredchu wants to merge 1 commit intodeusyu:mainfrom
fredchu:fix/calibre-text-cleanup
Open

fix: improve Calibre artifact cleanup#5
fredchu wants to merge 1 commit intodeusyu:mainfrom
fredchu:fix/calibre-text-cleanup

Conversation

@fredchu
Copy link
Copy Markdown

@fredchu fredchu commented Mar 22, 2026

Summary

扩展 clean_calibre_markers() 处理更多 Pandoc/Calibre 转换残留,每条规则附 before/after 样本。

从 PR #2 拆出来的第二项改动(文本清理),按 review 建议修正了激进的全局替换。

改动内容

规则 Before After
属性块(含 #id) ## Title {#calibre_link-0 .calibre3} ## Title
转义括号包裹 \[Some paragraph.\] Some paragraph.
空标题 ## (removed)
标题 [*text*] 包裹 ## [*Some Title*] ## Some Title

针对 review 意见的修正

  1. \[ / \] 不再全局替换 — 改用 (?m)^\\?\\\[(?m)\\?\\\]$,只匹配行首/行尾,不会破坏 LaTeX display math \[...\]
  2. strip('[]* ') 已移除 — 标题清理改用 re.match(r'^\[\s*\*(.+?)\*\s*\]$', text) 精确匹配 [*text*] 模式,不会误删合法字符

Test plan

  • 属性块移除:text{.calibre5}text
  • 转义括号:行首行尾的 \[...\] 被移除
  • LaTeX 保护:行中的 \[x+1\] 不受影响
  • 标题清理:## [*Some Title*]## Some Title
  • bold 括号:[**Chapter One**]**Chapter One**

🤖 Generated with Claude Code

Expand cleanup to handle more Pandoc/Calibre conversion artifacts:
- Attribute blocks with #id and .class (e.g. {#calibre_link-0 .calibre3})
- Escaped bracket wrapping \[...\] — only at line boundaries to
  preserve LaTeX display math mid-line
- Empty headings (# with no text)
- Heading [*Title*] wrapping via specific regex match

Each rule includes before/after samples in the docstring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant