HtmlHandler, for normalizing tag cases #24

kirbysayshi · 2011-06-01T04:59:56Z

As I thought through #20 and #22, I realized that the problem was not with the parser itself, but rather the results the parser created. Rather than hacking on the parser and breaking things like RSS/XML support, I decided a better approach would be to create another handler, called HtmlHandler. It embraces the case-insensitive nature of html tags, and toUpperCase()'s all tag names to respect the standard. When reserializing, the printHtml method (provided by tomdz) now toLowerCase()'s all tags, because it's printing HTML, not XML/RSS.

I've updated all tests, as well as added a few to test for scenarios where tags have mixed cases. This fork is currently in production on https://citational.com.

Please let me know any thoughts, as I'm more than willing to hear alternate opinions!

fix of htmlparser.DomUtils.getOuterHTML for directives

yep, it's insanely short

to get a signal when there won't be any more attributes coming

they are now available as `domhandler`

'case numbers are faster to compare NOT breaking due to last commit

Attention: The DOM changes slightly.

…quoted attribute values. Require self-closing tags to be void

…g the attributes count. Here's a different way to accomplish the same thing.

This reverts commit 181c31b.

This reverts commit f7b6d54.

…close is implied by other tags being opened, and these are closed when those tags are opened. This helps correctly parse things like lists and tables with unterminated LI or TD tags.

…correct spacing (and tried to match that)

also fixed some semantics

also replaced call to `Array#slice` with setting the stack's `length` property

… stream test

as required by mocha

fixes tautologistics#66

fixes cheeriojs/cheerio#247

failed previously (only for FeedHandler tests), fixed now due to DomHandler upgrade (which removed the `ignoreWhitespace` option)

as requested in fb55/css-select#11

as requested in tautologistics#70

fb55 and others added 30 commits June 2, 2012 20:45

removed switch in Stream.js

7750ec1

fixed whitespace

04476a0

quick fix for tautologistics#19

18d3f37

Fix getOuterHTML for directives

69c9f0f

Merge pull request tautologistics#21 from lahmatiy/master

f8e6aad

fix of htmlparser.DomUtils.getOuterHTML for directives

added lowerCaseAttributeNames option

82455a9

yep, it's insanely short

2.3.0

e0d359e

Added a onopentagend event

a8c13c8

to get a signal when there won't be any more attributes coming

moved DomHandler & DomUtils to their own module

c1dfdda

they are now available as `domhandler`

Updated readme

c0b7eda

2.3.1

a928109

publish the element types from DomHandler

b90c1e6

use numeric element types

b6c4a73

'case numbers are faster to compare NOT breaking due to last commit

don't expose HandlerModule

401cc09

fixed travis badge

f5925c9

stylistic changes

181c31b

use the new dom modules, 2.5.0

84012d6

Attention: The DOM changes slightly.

Made the attribute regular expression more correct with regards to un…

b3bc413

…quoted attribute values. Require self-closing tags to be void

I didn't understand how RegExps worked in this way, and was desynchin…

0f71a49

…g the attributes count. Here's a different way to accomplish the same thing.

Revert "stylistic changes"

f7b6d54

This reverts commit 181c31b.

Revert "Revert "stylistic changes""

c75da20

This reverts commit f7b6d54.

added missing comma in benchmark script

6730fde

domelementtype must be version 1.x (not 1.0)

840291e

2.5.1

46cd546

Merge branch 'master' of https://github.com/fb55/node-htmlparser

a68f329

Better handling of implied close tags. A list is given of tags whose …

a83c708

…close is implied by other tags being opened, and these are closed when those tags are opened. This helps correctly parse things like lists and tables with unterminated LI or TD tags.

spaces -> tabs, thought the merge would update my local files to the …

a1777a9

…correct spacing (and tried to match that)

Derp.

a126b18

added missing comma in benchmark script

5a72c28

domelementtype must be version 1.x (not 1.0)

eca12d8

fb55 added 29 commits August 18, 2013 20:07

3.2.2

36ee76e

[tokenizer] reintroduced _special, removed IN_SCRIPT and IN_STYLE

cce466c

also fixed some semantics

3.2.3

effc3a9

only respect self-closing tags in XML mode

e4fb613

[parser] properly removed self-closing tag support

80a1ecb

also replaced call to `Array#slice` with setting the stack's `length` property

[tests] read files in the tests file, improved os interoperability of…

0347cd7

… stream test

[tests] added helper.getCallback method

be0dafa

[tests] converted tests to mocha

b948e86

[tests] renamed tests dir to test

8737bf1

as required by mocha

[package] run mocha as the test script

96a00fb

Delete .DS_Store

41ad914

[tokenizer] emit onattribdata in _handleTrailingData

fc22b7d

fixes tautologistics#66

[tests] simplifications

336af9b

3.2.4

fc0918c

[readme] updated performance characteristics

7b1e4c9

[tokenizer] handle << correctly

76643d3

fixes cheeriojs/cheerio#247

3.2.5

2f24491

[tests] added test case for cheeriojs/cheerio#247

834d6d2

update to [email protected], updated FeedHandler accordingly, bump

994cfda

[tests] write only single characters for testing chunked data

11eba28

failed previously (only for FeedHandler tests), fixed now due to DomHandler upgrade (which removed the `ignoreWhitespace` option)

[package] require [email protected]

029c565

as requested in fb55/css-select#11

package: update readable-stream

e6418c2

package: use simple license field

0e5775c

replace non-breaking space with regular space

2c568d3

as requested in tautologistics#70

index: pass options argument to constructors

c9d4abe

tests: remove unused cb argument

298546c

feedhandler: wrap assignments

f9bc72f

tests: changed indentation to tabs

5f244df

package: updated dom module versions, 3.4.0

7153b27

kirbysayshi closed this Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HtmlHandler, for normalizing tag cases #24

HtmlHandler, for normalizing tag cases #24

kirbysayshi commented Jun 1, 2011

HtmlHandler, for normalizing tag cases #24

HtmlHandler, for normalizing tag cases #24

Conversation

kirbysayshi commented Jun 1, 2011