HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
- Install
- Which API should I use?
- TypeScript definitions
- API Reference
- FAQ
- Version history
- GitHub repository
$ npm install parse5
Use parse5.parse method.
Use parse5.parseFragment method.
Use parse5.serialize method.
"I need to parse HTML streamed from network or from file." or "I need to implement <script> execution and document.write"
Use parse5.ParserStream class.
"I don't need a document tree, but I need a basic information about tags or attributes" or "I need to extract a text content from huge amount of documents" or "I need to analyze content that going through my proxy server".
Use parse5.SAXParser class.
Use parse5.PlainTextConversionStream class.
Use parse5.SerializerStream class.
Use locationInfo options: ParserOptions.locationInfo, SAXParserOptions.locationInfo.
Use treeAdapter options: ParserOptions.treeAdapter and SerializerOptions.treeAdapter
with one of two built-in tree formats.
Implement TreeAdapter interface and then use treeAdapter option to pass it to parser or serializer.
parse5 package includes a TypeScript definition file. Therefore you don't need to install any typings to use parse5 in TypeScript files. Note that since parse5 supports multiple output tree formats you need to manually cast generic node interfaces to the appropriate tree format to get access to the properties:
import * as parse5 from 'parse5';
// Using default tree adapter.
var document1 = parse5.parse('<div></div>') as parse5.AST.Default.Document;
// Using htmlparser2 tree adapter.
var document2 = parse5.parse('<div></div>', {
treeAdapter: parse5.TreeAdapters.htmlparser2
}) as parse5.AST.HtmlParser2.Document;You can find documentation for interfaces in API reference.
You can create a custom tree adapter, so that parse5 can work with your own DOM-tree implementation.
Then pass it to the parser or serializer via the treeAdapter option:
const parse5 = require('parse5');
const myTreeAdapter = {
//Adapter methods...
};
const document = parse5.parse('<div></div>', { treeAdapter: myTreeAdapter });
const html = parse5.serialize(document, { treeAdapter: myTreeAdapter });Refer to the API reference for the description of methods that should be exposed by the tree adapter, as well as links to their default implementation.
Compile it with browserify and you're set.
Q: I'm parsing <img src="foo"> with the SAXParser and I expect the selfClosing flag to be true for the <img> tag. But it's not. Is there something wrong with the parser?
No. A self-closing tag is a tag that has a / before the closing bracket. E.g: <br/>, <meta/>.
In the provided example, the tag simply doesn't have an end tag. Self-closing tags and tags without end tags are treated differently by the
parser: in case of a self-closing tag, the parser does not look up for the corresponding closing tag and expects the element not to have any content.
But if a start tag is not self-closing, the parser treats everything that follows it (with a few exceptions) as the element content.
However, if the start tag is in the list of void elements, the parser expects the corresponding
element not to have content and behaves in the same way as if the element was self-closing. So, semantically, if an element is
void, self-closing tags and tags without closing tags are equivalent, but it's not true for other tags.
TL;DR: selfClosing is a part of lexical information and is set only if the tag has / before the closing bracket in the source code.
Most likely, it's not. There are a lot of weird edge cases in HTML5 parsing algorithm, e.g.:
<b>1<p>2</b>3</p>will be parsed as
<b>1</b><p><b>2</b>3</p>Just try it in the latest version of your browser before submitting an issue.
- Fixed:
location.startTagis not available if end tag is missing (GH #181);
- Fixed:
MarkupData.Location.coldescription in TypeScript definition file (GH #170);
- Added: parse5 now ships with TypeScript definitions from which new documentation website is generated (GH #125).
- Added: PlainTextConversionStream (GH #135).
- Updated: Significantly reduced initial memory consumption (GH #52).
- Updated (breaking): Added support for limited quirks mode.
document.quirksModeproperty was replaced withdocument.modeproperty which can have'no-quirks','quirks'and'limited-quirks'values. Tree adaptersetQuirksModeandisQuirksModemethods were replaced withsetDocumentModeandgetDocumentModemethods (GH #83). - Updated (breaking): AST collections (e.g. attributes dictionary) don't have prototype anymore (GH #119).
- Updated (breaking): Doctype now always serialized as
<!DOCTYPE html>as per spec (GH #137). - Fixed: Incorrect line for
__location.endTagwhen the start tag contains newlines (GH #166) (by @webdesus).
- Fixed: Fixed incorrect LocationInfo.endOffset for non-implicitly closed elements (refix for GH #109) (by @wooorm).
- Fixed: Incorrect location info for text in SAXParser (GH #153).
- Fixed: Incorrect
LocationInfo.endOffsetfor implicitly closed<p>element (GH #109). - Fixed: Infinite input data buffering in streaming parsers. Now parsers try to not buffer more than 64K of input data. However, there are still some edge cases left that will lead to significant memory consumption, but they are quite exotic and extremely rare in the wild (GH #102, GH #130);
- Fixed: SAXParser HTML integration point handling for adjustable SVG tags.
- Fixed: SAXParser now adjust SVG tag names for end tags.
- Fixed: Location info line calculation on tokenizer character unconsumption (by @ChadKillingsworth).
-
SAXParser (by @RReverser)
-
Fixed: Handling of
\nin<pre>,<textarea>and<listing>. -
Fixed: Tag names and attribute names adjustment in foreign content (GH #99).
-
Fixed: Handling of
<image>. -
Latest spec changes
-
Updated:
<isindex>now don't have special handling (GH #122). -
Updated: Adoption agency algorithm now preserves lexical order of text nodes (GH #129).
-
Updated:
<menuitem>now behaves like<option>. -
Fixed: Element nesting corrections now take namespaces into consideration.
- Fixed: ParserStream accidentally hangs up on scripts (GH #101).
- Fixed: Keep ParserStream sync for the inline scripts (GH #98 follow up).
- Fixed: Synchronously calling resume() leads to crash (GH #98).
- Fixed: SAX parser silently exits on big files (GH #97).
- Fixed: location info not attached for empty attributes (GH #96) (by @yyx990803).
- Added: location info for attributes (GH #43) (by @sakagg and @yyx990803).
- Fixed:
parseFragmentwithlocationInforegression when parsing<template>(GH #90) (by @yyx990803).
- Fixed: yet another case of incorrect
parseFragmentarguments fallback (GH #84).
- Fixed:
parseFragmentarguments processing (GH #82).
- Added: ParserStream with the scripting support. (GH #26).
- Added: SerializerStream. (GH #26).
- Added: Line/column location info. (GH #67).
- Update (breaking): Location info properties
startandendwere renamed tostartOffsetandendOffsetrespectively. - Update (breaking):
SimpleApiParserwas renamed to SAXParser. - Update (breaking): SAXParser is the transform stream now. (GH #26).
- Update (breaking): SAXParser handler subscription is done via events now.
- Added: SAXParser.stop(). (GH #47).
- Add (breaking): parse5.parse() and parse5.parseFragment()
methods as replacement for the
Parserclass. - Add (breaking): parse5.serialize() method as replacement for the
Serializerclass. - Updated: parsing algorithm was updated with the latest HTML spec changes.
- Removed (breaking):
decodeHtmlEntitiesandencodeHtmlEntitiesoptions. (GH #75). - Add (breaking): TreeAdapter.setTemplateContent() and TreeAdapter.getTemplateContent() methods. (GH #78).
- Update (breaking):
defaulttree adapter now stores<template>content intemplate.contentproperty instead oftemplate.childNodes[0].
- Fixed: Qualified tag name emission in Serializer (GH #79).
- Added: Location info for the element start and end tags (by @sakagg).
- Fixed: htmlparser2 tree adapter
DocumentType.dataproperty rendering (GH #45).
- Fixed: Location info handling for the implicitly generated
<html>and<body>elements (GH #44).
- Added: Parser decodeHtmlEntities option.
- Added: SimpleApiParser decodeHtmlEntities option.
- Added: Parser locationInfo option.
- Added: SimpleApiParser locationInfo option.
- Fixed:
<form>processing in<template>(GH #40).
- Fixed: text node in
<template>serialization problem with custom tree adapter (GH #38).
- Added: Serializer
encodeHtmlEntitiesoption.
- Added:
<template>support parseFragmentnow uses<template>as defaultcontextElement. This leads to the more "forgiving" parsing manner.TreeSerializerwas renamed toSerializer. However, serializer is accessible asparse5.TreeSerializerfor backward compatibility .
- Fixed: apply latest changes to the
htmlparser2tree format (DOM Level1 node emulation).
- Added: jsdom-specific parser with scripting support. Undocumented for
jsdominternal use only.
- Added: logo
- Fixed: use fake
documentelement for fragment parsing (required by jsdom).
- Development files (e.g.
.travis.yml,.editorconfig) are removed from NPM package.
- Fixed: crash on Linux due to upper-case leading character in module name used in
require().
- Added: SimpleApiParser.
- Fixed: new line serialization in
<pre>. - Fixed:
SYSTEM-onlyDOCTYPEserialization. - Fixed: quotes serialization in
DOCTYPEIDs.
- First stable release, switch to semantic versioning.
- Fixed: siblings calculation bug in
appendChildinhtmlparser2tree adapter.
- Added: TreeSerializer.
- Added: htmlparser2 tree adapter.
- Fixed: incorrect
<menuitem>handling in<body>.
- Initial release.
