Skip to content

Use libxml to parse feeds#559

Draft
aavanian wants to merge 3 commits intoskeeto:masterfrom
aavanian:libxml
Draft

Use libxml to parse feeds#559
aavanian wants to merge 3 commits intoskeeto:masterfrom
aavanian:libxml

Conversation

@aavanian
Copy link
Copy Markdown

@aavanian aavanian commented Mar 24, 2026

This is an attempt at using libxml to parse feeds for performance gains. Idea came while I was reviewing this post for ways to reduce the freeze while udpating feeds and the discussion in the post pointed toward xml parsing.

Diagnostic, benchmarking and code changes done with the help of Claude Code.
If that's a non-starter, I'll cancel the PR.

Archeology

I noticed afterwards there are 2+ old (12y) issues where it is mentioned that no performance gains were found using libxml and a related 13y branch (prefer-libxml).

Tests

In my tests, per-feed speedup ranged from 2× on tiny feeds (< 1 KB) to 20× on larger feeds. The largest feed (16.9 MB) went from 1096ms to 75ms per parse.
In absolute, these changes are not that large, my updates are still blocking, only a bit less so, but enough to make it less frustrating in my case.

The PR

The code change is pretty trivial, since the 4 of the 5 parsing differences are absorbed (white space and node split) by elfeed post-processing, the last one is detailed below.

Idea if it helps: the use of libxml could be controlled by a customize-able variable and leave the default to current code even if libxml is available.

Parsing difference

There was actually a long standing bug in elfeed since the introduction of the file:// protocol in cc9d3b2. curl -D- outputs synthetic headers (Content-Length, Accept-ranges, Last-Modified) for file:// URLs, but %{size_header} reports 0. This causes the pseudo-headers to be included in the content region.

  • xml-parse-region would strip the unexpected headers and return correct xml so the bug wasn't visible.
  • but libxml-parse-xml-region is stricter, chokes on it and return nil

hence the fix to strip the headers manually for the file protocol.

Benchmark & Analysis

The two later commits are not necessarily meant to be part of the PR, unless it is desired:

  • 238099a: set of scripts to run against an org-defined list of feeds (ie elfeed-org) to get performance benchmark and parsing differences between the current code and the proposed code.
  • d252f10: show the summarized results for my own feeds.

IMPORTANT: The parsing differences have only been assessed ex-post on my examples and a few other tests. No bottom-up analysis has been made.

I'm leaving this PR in draft while I run my daily use off my branch for a bit.

Other points noted during the analysis

  1. Parallelization of xml parsing (whatever the lib used). Not really worth it as the serde costs can be pretty high. A test on a medium-ish feed shows 7-8x slower than direct parsing. That is, unless restructuring a lot of the code into the async'ed part so the data passed back is the strict essential -> too deep for me to consider
  2. Content ref writes (elfeed-db.el:533-548). For every new entry, elfeed-deref-entry calls elfeed-ref which: computes secure-hash 'sha1, checks file-exists-p, and writes content via with-temp-file. That's synchronous file I/O per entry. -> since that touches the db insertion, i didn't want to consider this especially in trying to batch or parallelize updates.
  3. Callback storm (elfeed-curl.el:397-411). When a curl process finishes, elfeed-curl--sentinel schedules all callbacks via (run-at-time 0 nil ...). These all fire on the next event loop iteration back-to-back. If a consolidated curl process fetched several feeds, their XML parsing + DB writes happen sequentially with no yielding to input events. -> this would easily be changed but direct trade-off between speed of update and letting emacs responsive during the update, not the optimization I was looking for.
  4. Search buffer updates (elfeed-search.el:657-721). elfeed-search-update walks the entire AVL tree with with-elfeed-db-visit and re-renders the buffer. If the search buffer is visible during the update storm, mode-line/header refreshes from elfeed-goodies compound the freeze. Same as above. I note there's Debounce search buffer updates during feed fetches #558 that would tackle this part 👍.

aavanian and others added 3 commits March 27, 2026 11:21
libxml2's C-level parser is ~6x faster than the Elisp xml-parse-region,
reducing UI blocking during feed updates. The encoding detection/recode
path is preserved for both backends since Emacs's libxml binding only
handles UTF-8 content correctly.

A new helper elfeed-xml--libxml-unwrap handles the synthetic `top' node
that libxml produces when comments or processing instructions precede
the root element.

Falls back to xml-parse-region when Emacs is built without libxml2.

Includes a fix for handling curl's pseudo-headers for `file://` urls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scripts to download real feeds, parse with both xml-parse-region and
libxml-parse-xml-region, compare correctness of all xml-query patterns
elfeed uses, and measure performance.

Usage:
  emacs --batch -Q -l bench/extract-urls.el /path/to/feeds.org > urls.txt
  bash bench/run.sh urls.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
183 real-world feeds tested, 5.8x overall speedup with libxml2,
up to 20x on larger feeds. All elfeed xml-query patterns produce
equivalent results after two well-understood transformations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant