Draft
Conversation
libxml2's C-level parser is ~6x faster than the Elisp xml-parse-region, reducing UI blocking during feed updates. The encoding detection/recode path is preserved for both backends since Emacs's libxml binding only handles UTF-8 content correctly. A new helper elfeed-xml--libxml-unwrap handles the synthetic `top' node that libxml produces when comments or processing instructions precede the root element. Falls back to xml-parse-region when Emacs is built without libxml2. Includes a fix for handling curl's pseudo-headers for `file://` urls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scripts to download real feeds, parse with both xml-parse-region and libxml-parse-xml-region, compare correctness of all xml-query patterns elfeed uses, and measure performance. Usage: emacs --batch -Q -l bench/extract-urls.el /path/to/feeds.org > urls.txt bash bench/run.sh urls.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
183 real-world feeds tested, 5.8x overall speedup with libxml2, up to 20x on larger feeds. All elfeed xml-query patterns produce equivalent results after two well-understood transformations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an attempt at using libxml to parse feeds for performance gains. Idea came while I was reviewing this post for ways to reduce the freeze while udpating feeds and the discussion in the post pointed toward xml parsing.
Diagnostic, benchmarking and code changes done with the help of Claude Code.
If that's a non-starter, I'll cancel the PR.
Archeology
I noticed afterwards there are 2+ old (12y) issues where it is mentioned that no performance gains were found using libxml and a related 13y branch (prefer-libxml).
Tests
In my tests, per-feed speedup ranged from 2× on tiny feeds (< 1 KB) to 20× on larger feeds. The largest feed (16.9 MB) went from 1096ms to 75ms per parse.
In absolute, these changes are not that large, my updates are still blocking, only a bit less so, but enough to make it less frustrating in my case.
The PR
The code change is pretty trivial, since the 4 of the 5 parsing differences are absorbed (white space and node split) by
elfeedpost-processing, the last one is detailed below.Idea if it helps: the use of libxml could be controlled by a customize-able variable and leave the default to current code even if libxml is available.
Parsing difference
There was actually a long standing bug in
elfeedsince the introduction of thefile://protocol in cc9d3b2.curl -D-outputs synthetic headers (Content-Length,Accept-ranges,Last-Modified) forfile://URLs, but%{size_header}reports 0. This causes the pseudo-headers to be included in the content region.xml-parse-regionwould strip the unexpected headers and return correct xml so the bug wasn't visible.libxml-parse-xml-regionis stricter, chokes on it and return nilhence the fix to strip the headers manually for the
fileprotocol.Benchmark & Analysis
The two later commits are not necessarily meant to be part of the PR, unless it is desired:
IMPORTANT: The parsing differences have only been assessed ex-post on my examples and a few other tests. No bottom-up analysis has been made.
I'm leaving this PR in draft while I run my daily use off my branch for a bit.
Other points noted during the analysis