A Node.js package to do some basic HTML parsing and CSS selectors.
htmlSoup.parse(htmlString, trimText = true) -> DOM
htmlString
: The HTML to parse (string orBuffer
). If an&
is used followed by an alphanumeric character or#
, it will be assumed to start an HTML escape sequence. If a tag that is supposed to have a closing tag does not have one, it will be assumed to continue until a closing tag that doesn't close an inner element or the end of the document is reached. Closing tags will close the innermost open tag preceding them regardless of whether the types match.trimText
: Whether to trim all text (removing leading or trailing whitespace) between HTML tags. If the trimmed text is empty, no text node will be created.- DOM format: Either a single
TextNode
orHtmlTag
or an array of instances of either class.TextNode
has a single field,text
containing the text inside.HtmlTag
has the following fields:type
: The HTML tag type, e.g.div
. If the document uses an uppercase tag, this field's value will be uppercased as well.attributes
: AnObject
mapping attribute names to string values if provided, ortrue
if no value is provided. For example,<input type = "checkbox" checked />
gives anattributes
value of{type: 'checkbox', checked: true}
. Attributes are automatically lower-cased.children
: AnArray
of child nodes. Each is either aTextNode
orHtmlTag
.parent
: The parentHtmlTag
. On the root node, this field has the valuenull
.
When navigating the DOM tree, you can use htmlTag.child
to get the first child of a tag. htmlTag.classes
will give a set of classes of the tag.
htmlSoup.select(dom, selectorString) -> Set<HtmlTag>
dom
: DOM tree to search through (presumably an output ofhtmlSoup.parse()
)selectorString
: A CSS selector string specifying which elements to select. Allowed parts of the selector (can be combined):*
: select elements of any typetag
: select elements of typetag
(case-insensitive).class
: select elements of classclass
#id
: select elements of idid
selector1 selector2
: select elements matchingselector2
that are descendants of elements matchingselector1
selector1 > selector2
: select elements matchingselector2
that are children of elements matchingselector1
selector1 + selector2
: select elements matchingselector2
that are siblings of and directly follow elements matchingselector1
selector1 ~ selector2
: select elements matchingselector2
that are siblings of and follow elements matchingselector1
selector1, selector2
: select elements matching eitherselector1
orselector2
[attr]
: select elements with attributeattr
present[attr=val]
or[attr="val"]
: select elements with attributeattr
having the valueval
[attr~=val]
or[attr~="val"]
: select elements with attributeattr
's value containingval
withval
preceded by a hypen, space, or at the start of the value andval
followed by a hypen, space, or at the end of the value[attr|=val]
or[attr|="val"]
: select elements with attributeattr
's value starting withval
and followed by a hypen, space, or at the end of the value[attr^=val]
or[attr^="val"]
: select elements with attributeattr
's value starting withval
[attr$=val]
or[attr$="val"]
: select elements with attributeattr
's value ending withval
[attr*=val]
or[attr*="val"]
: select elements with attributeattr
's value containingval
- These CSS pseudo-classes are also supported:
:checked
,:disabled
,:empty
,:first-child
,:first-of-type
,:indeterminate
,:last-child
,:last-of-type
,:nth-child()
,:nth-last-child()
,:nth-last-of-type()
,:nth-of-type()
,:only-child
,:only-of-type
,:optional
,:required
,:root
let dom = htmlSoup.parse('<div id="one">Hi</div>')
/*
HtmlTag {
type: 'div',
attributes: { id: 'one' },
parent: HtmlTag {
type: '',
attributes: {},
parent: null,
children: [ [Circular] ]
},
children: [ TextNode { text: 'Hi' } ]
}
*/
let text = dom.child // TextNode { text: 'Hi' }
let firstYellow = htmlSoup.select(
htmlSoup.parse(`
<p>One</p>
<p class="red yellow">Two</p>
<p class="yellow">Three</p>
`),
'p.yellow:first-of-type'
)
/*
Set {
HtmlTag {
type: 'p',
attributes: { class: 'red yellow' },
parent: HtmlTag {
type: '',
attributes: {},
parent: null,
children: [Array]
},
children: [ [TextNode] ]
}
}
*/
let {classes} = htmlSoup.parse('<div class="one two three"></div>')
// Set { 'one', 'two', 'three' }