Use libxml to parse feeds by aavanian · Pull Request #559 · skeeto/elfeed

aavanian · 2026-03-24T14:13:13Z

This is an attempt at using libxml to parse feeds for performance gains. Idea came while I was reviewing this post for ways to reduce the freeze while udpating feeds and the discussion in the post pointed toward xml parsing.

Diagnostic, benchmarking and code changes done with the help of Claude Code.
If that's a non-starter, I'll cancel the PR.

Archeology

I noticed afterwards there are 2+ old (12y) issues where it is mentioned that no performance gains were found using libxml and a related 13y branch (prefer-libxml).

Tests

In my tests, per-feed speedup ranged from 2× on tiny feeds (< 1 KB) to 20× on larger feeds. The largest feed (16.9 MB) went from 1096ms to 75ms per parse.
In absolute, these changes are not that large, my updates are still blocking, only a bit less so, but enough to make it less frustrating in my case.

The PR

The code change is pretty trivial, since the 4 of the 5 parsing differences are absorbed (white space and node split) by elfeed post-processing, the last one is detailed below.

Idea if it helps: the use of libxml could be controlled by a customize-able variable and leave the default to current code even if libxml is available.

Parsing difference

There was actually a long standing bug in elfeed since the introduction of the file:// protocol in cc9d3b2. curl -D- outputs synthetic headers (Content-Length, Accept-ranges, Last-Modified) for file:// URLs, but %{size_header} reports 0. This causes the pseudo-headers to be included in the content region.

xml-parse-region would strip the unexpected headers and return correct xml so the bug wasn't visible.
but libxml-parse-xml-region is stricter, chokes on it and return nil

hence the fix to strip the headers manually for the file protocol.

Benchmark & Analysis

The two later commits are not necessarily meant to be part of the PR, unless it is desired:

238099a: set of scripts to run against an org-defined list of feeds (ie elfeed-org) to get performance benchmark and parsing differences between the current code and the proposed code.
d252f10: show the summarized results for my own feeds.

IMPORTANT: The parsing differences have only been assessed ex-post on my examples and a few other tests. No bottom-up analysis has been made.

I'm leaving this PR in draft while I run my daily use off my branch for a bit.

Other points noted during the analysis

Parallelization of xml parsing (whatever the lib used). Not really worth it as the serde costs can be pretty high. A test on a medium-ish feed shows 7-8x slower than direct parsing. That is, unless restructuring a lot of the code into the async'ed part so the data passed back is the strict essential -> too deep for me to consider
Content ref writes (elfeed-db.el:533-548). For every new entry, elfeed-deref-entry calls elfeed-ref which: computes secure-hash 'sha1, checks file-exists-p, and writes content via with-temp-file. That's synchronous file I/O per entry. -> since that touches the db insertion, i didn't want to consider this especially in trying to batch or parallelize updates.
Callback storm (elfeed-curl.el:397-411). When a curl process finishes, elfeed-curl--sentinel schedules all callbacks via (run-at-time 0 nil ...). These all fire on the next event loop iteration back-to-back. If a consolidated curl process fetched several feeds, their XML parsing + DB writes happen sequentially with no yielding to input events. -> this would easily be changed but direct trade-off between speed of update and letting emacs responsive during the update, not the optimization I was looking for.
Search buffer updates (elfeed-search.el:657-721). elfeed-search-update walks the entire AVL tree with with-elfeed-db-visit and re-renders the buffer. If the search buffer is visible during the update storm, mode-line/header refreshes from elfeed-goodies compound the freeze. Same as above. I note there's Debounce search buffer updates during feed fetches #558 that would tackle this part 👍.

libxml2's C-level parser is ~6x faster than the Elisp xml-parse-region, reducing UI blocking during feed updates. The encoding detection/recode path is preserved for both backends since Emacs's libxml binding only handles UTF-8 content correctly. A new helper elfeed-xml--libxml-unwrap handles the synthetic `top' node that libxml produces when comments or processing instructions precede the root element. Falls back to xml-parse-region when Emacs is built without libxml2. Includes a fix for handling curl's pseudo-headers for `file://` urls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Scripts to download real feeds, parse with both xml-parse-region and libxml-parse-xml-region, compare correctness of all xml-query patterns elfeed uses, and measure performance. Usage: emacs --batch -Q -l bench/extract-urls.el /path/to/feeds.org > urls.txt bash bench/run.sh urls.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

183 real-world feeds tested, 5.8x overall speedup with libxml2, up to 20x on larger feeds. All elfeed xml-query patterns produce equivalent results after two well-understood transformations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aavanian and others added 3 commits March 27, 2026 11:21

aavanian force-pushed the libxml branch from 1359e1a to d252f10 Compare March 27, 2026 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libxml to parse feeds#559

Use libxml to parse feeds#559
aavanian wants to merge 3 commits intoskeeto:masterfrom
aavanian:libxml

aavanian commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aavanian commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Archeology

Tests

The PR

Parsing difference

Benchmark & Analysis

Other points noted during the analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aavanian commented Mar 24, 2026 •

edited

Loading