Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions blog/2026-03-11-DocWire-up-to-date-2026.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: 'DocWire: A 2025 Retrospective'
authors: krzysztof
tags: [news, update]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw different tags in the source article

---

For a while, this blog has been quiet — but DocWire hasn’t.

Behind the scenes, 2025 became a defining year for the project. What started as a document-processing toolkit gradually transformed into something far more ambitious: a modular, message-driven framework capable of powering modern data pipelines and AI-driven workflows.

This post is a short retrospective of that transformation — and a look at the engineering decisions that made it possible.


<!--truncate-->

## A New Core: Message-Driven Architecture

The most fundamental shift happened deep inside the SDK.

Earlier versions relied on tightly coupled data representations, which made extending the system increasingly difficult. Adding new data types often required changes across multiple components.

In 2025, DocWire introduced a message-driven core — a decoupled model where processing stages communicate through generic messages flowing across pipelines.

This shift unlocked several key benefits:
- clearer data flow across processing stages
- independent, composable components
- easier extensibility without breaking existing pipelines
- the ability to expose pipelines as standalone microservices

Rather than a collection of utilities, DocWire started to behave like a coherent processing platform.



## AI Integration: Local and Cloud by Design

The new architecture enabled deeper integration with language models and semantic processing.

DocWire pipelines can now leverage both:
- local models for offline and privacy-sensitive scenarios
- cloud-based models for advanced conversational or generative workflows

This hybrid approach gives developers flexibility to choose the right execution context for each task — from fully offline semantic search to cloud-augmented analysis pipelines.

Even lightweight local models can provide multilingual embeddings and semantic understanding directly within applications, without network dependencies.



## High-Fidelity Document Understanding

Alongside architectural changes, the core parsing capabilities evolved significantly.

Throughout 2025, DocWire introduced:
- a redesigned HTML parsing pipeline
- a new PDF engine with improved stability
- image extraction from complex document formats
- position-aware layout reconstruction

This progression culminated in what we describe as high-fidelity document understanding: the ability to capture not only textual content but also the structural and spatial context of documents.

Such layout awareness is essential for accurate indexing, semantic retrieval, and downstream AI processing.



## Developer Experience and Production Readiness

A major focus of the year was making DocWire not only powerful but also pleasant to build with.

Key improvements included:
- a zero-overhead diagnostic system for development-time insights
- structured JSON logging for easier observability
- richer error reporting with contextual metadata
- safer assertions and defensive programming utilities
- modernization of the build system and CI workflows

These changes may be less visible than new features, but they are critical for building reliable production systems.



## Watch: 2025 Engineering Recap

To illustrate this transformation visually, we prepared a short overview of the architectural and functional changes introduced throughout the year.

[![Video title](https://img.youtube.com/vi/vBgrIh04R-I/hqdefault.jpg)](https://youtu.be/vBgrIh04R-I)



## What This Means

Looking back, 2025 was less about incremental improvement and more about redefinition.

DocWire is no longer just a library integrated into applications. The foundations now exist for treating it as an application backbone — a processing layer capable of running independently as part of larger data infrastructures.

This evolution opens the door to new usage patterns:
- pipeline-based document ingestion
- service-oriented deployment models
- local AI processing workflows
- backend systems centered around structured data extraction

In short, the toolkit matured into a framework.



## What Comes Next

This retrospective sets the stage for what follows.

In the next post, we’ll introduce a concept that helps describe this evolution at a higher level: DocWire as the invisible engine behind secure data pipelines.

For now, if you’ve been following the project — or discovering it for the first time — 2025 represents a turning point worth exploring.
5 changes: 5 additions & 0 deletions blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,8 @@ ferid:
name: Ferid Obeidat
title: Author
image_url: /img/Ferid-headshot.png
krzysztof:
name: Krzysztof Nowicki
title: Author
image_url: /img/krzysztof.png

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This blank line with trailing whitespace is unnecessary and should be removed to maintain a clean codebase and prevent potential parsing issues with some tools.

Binary file added static/img/krzysztof.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.