Skip to content

grimmory-tools/epub4j

Repository files navigation

epub4j

Java library for EPUB read, validate, repair, normalize, transform, and write workflows.

What it does

  • Read EPUB from path, stream, or resources
  • Write EPUB with package and metadata updates
  • Lazy load resources for lower memory usage
  • Validate structure, metadata, manifest, spine, references, and accessibility
  • Run diagnostics with severity, error codes, and auto fix hints
  • Auto repair common issues in malformed EPUB files
  • Prune broken TOC entries and promote valid child entries
  • Remove unreferenced JavaScript resources from manifest resources
  • Remove common non-content artifact files (iTunes metadata, authoring tool bookmarks, OS leftovers)
  • Validate EPUB mimetype entry and report strict/recover behavior
  • Normalize invalid language tags and remove stray img tags with missing src
  • Rebuild and normalize spine reading order from manifest XHTML resources
  • Reconcile spine href/idref alias drift to canonical manifest resources
  • Harden XHTML pre-parse well-formedness before downstream XML processing
  • Repair broken internal href/src/url link graph using safe alias rewrite heuristics
  • Generate KOReader-compatible partial MD5 checksums for dedupe/progress-sync IDs
  • Normalize mixed encodings to UTF-8
  • Normalize metadata fields and infer missing metadata
  • Detect cover and synthesize missing table of contents
  • Manipulate spine and split or merge XHTML
  • Run search and replace across content resources
  • Estimate word count
  • Deduplicate resources
  • Convert to kepub

Reliability and safety

  • Strict and recover processing modes
  • Archive path traversal protection
  • Duplicate entry detection
  • Archive level byte budget
  • Per entry byte budget
  • Total uncompressed byte budget
  • Bounded stream copy for input streams
  • Case stable path deduplication using Locale.ROOT

Quick start

import org.grimmory.epub4j.domain.Book;
import org.grimmory.epub4j.epub.EpubProcessingPolicy;
import org.grimmory.epub4j.epub.EpubReader;

EpubProcessingPolicy policy = EpubProcessingPolicy.defaultPolicy()
    .withMaxArchiveBytes(256L * 1024 * 1024)
    .withMaxEntryBytes(32L * 1024 * 1024)
    .withMaxTotalUncompressedBytes(512L * 1024 * 1024);

EpubReader reader = new EpubReader(null, policy);
Book book = reader.readEpub(java.nio.file.Path.of("book.epub"));

Strict mode

import org.grimmory.epub4j.epub.EpubProcessingPolicy;
import org.grimmory.epub4j.epub.EpubReader;

EpubReader reader = new EpubReader(null, EpubProcessingPolicy.strictPolicy());
var book = reader.readEpubStrict(java.nio.file.Path.of("book.epub"));

Recover mode report

import org.grimmory.epub4j.epub.EpubReader;

EpubReader reader = new EpubReader();
EpubReader.ReadResult result = reader.readEpubWithReport(java.nio.file.Path.of("book.epub"));

if (result.report().hasWarnings()) {
    result.report().warnings().forEach(w ->
        System.out.println(w.code() + ": " + w.message())
    );
}

if (result.report().hasCorrections()) {
    result.report().corrections().forEach(System.out::println);
}

Advanced repair pass

BookRepair now includes a stricter cleanup pass for XHTML content:

  • Guarded lowercasing of legacy HTML tag and attribute names in XHTML resources
  • Preservation of namespaced attributes (for example xlink:href)
  • Removal of Adobe DRM meta markers and inline script artifacts
  • Pruning of broken TOC references against actual XHTML resources
  • Optional JavaScript resource pruning when files are no longer referenced
  • Removal of common non-content artifact files
  • Mimetype validation with strict failure or recover-mode warnings
  • Language tag normalization and stray <img> cleanup
  • Ebooklib-style spine normalization: drop invalid/duplicate/non-XHTML spine refs and append missing XHTML content docs
  • Manifest/spine alias reconciliation for href/idref drift in mixed-encoding paths
  • XHTML pre-parse hardening inspired by html5lib/lxml/xmllint defensive parsing workflows
  • Link graph repair pass for broken internal href/src/url targets with conservative rewrites
import org.grimmory.epub4j.epub.BookRepair;

BookRepair repair = new BookRepair();
BookRepair.RepairResult repaired = repair.repair(book);

repaired.actions().forEach(a ->
    System.out.println(a.code() + " -> " + a.description())
);

KOReader-compatible checksum

Ported from CWA/KOReader checksum behavior for lightweight file identity workflows:

import org.grimmory.epub4j.util.KoReaderChecksum;

var byPath = KoReaderChecksum.calculate(java.nio.file.Path.of("book.epub"));
var byBytes = KoReaderChecksum.calculate(epubBytes);

Roadmap

  • Broken link validation and auto-repair for guide, TOC, and in-document href/src
  • Unused CSS and unused image detection/removal
  • OPF metadata schema cleanup and stronger namespace normalization
  • Optional OPF2 to OPF3 upgrade helpers with nav document regeneration
  • Batch/background job execution API for large repair/validation runs
  • Metadata backup snapshot export and restore hooks
  • Ingest-safe MIME/content sniffing beyond extension checks
  • Optional duplicate detection heuristics for library hygiene

Build

./gradlew build

Runtime And Toolchain Requirements

  • Java 25
  • JVM flags for preview and native interop paths:
--enable-preview --enable-native-access=ALL-UNNAMED

Quality Workflow

Run the verification path used by CI:

./gradlew check --warning-mode all

For focused module checks while iterating:

./gradlew :comic4j:check

About

A java library for reading/writing/manipulating EPUB files, with improvements based on epublib and epub4j

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors