ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module
- HTML CSS selectors (CSS3 standard min), Xpath
- regular expressions (PCRE)
- designed for SSR (server-side-render) html pages parsers, NOT FOR REST-API, GRAPHQL ENDPOINTS
- decrease boilerplate code
- generates independent modules from the project that can be reused.
- generates docstring documentation and the signature of the parser output.
- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
- support annotation and parsing of JSON-like strings from a document
- AST API codegen for developing a converter for parsing
Current support converters
| Language | HTML parser lib + dependencies | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |
|---|---|---|---|---|---|---|
| Python (3.8-3.13) | bs4, lxml ( typing_extensions if py < 3.10 ) | N | Y | Y | TypedDict1, list, dict |
ruff |
| ... | parsel ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
| ... | selectolax (lexbor) ( typing_extensions if py < 3.10 ) | N | Y | N | ... | ... |
| ... | lxml ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
js (ES6)2 |
pure (firefox/chrome extension/nodejs) | Y | Y | Y | JSDoc | prettier |
| go (1.10+) (UNSTABLE) | goquery, gjson (4) |
N | Y | N | struct(+json anchors), array, map | gofmt |
lua (5.2+), luajit(2+) (UNSTABLE)5 |
lua-htmlparser, lrexlib(opt), dkjson | N | Y | N | EmmyLua | LuaFormatter |
-
CSS3 means support next selectors:
- basic: (
tag,.class,#id,tag1,tag2) - combined: (
div p,ul > li,h2 +p,title ~head) - attribute: (
a[href],input[type='text'],a[href*='...'], ...) - CSS3 pseudo classes: (
:nth-child(n),:first-child,:last-child)
- basic: (
-
CSS4 means support next selectors:
:nth-of-type(),:where(),:is(),:not()etc
-
1this annotation type was deliberately chosen as a compromise reasons: Python has many ways of serialization:namedtuple, dataclass, attrs, pydantic, msgspec, etc- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
-
2ES8 standart required if needed use PCREre.S | re.DOTALLflag -
3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation! -
4golang has not been tested much, there may be issues -
formatter dependency - optional dependency for prettify and fix codestyle
-
5lua- Experimental Research PoC, performance and stability are not guaranteed
- Priority on generation to pure lua without C-libs dependencies. using mva/htmlparser and dhkolf/dkjson
- Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
- for example,
div +pis equivalent toCssExt.combine_plus(root:select("div"), "p")
- for example,
- Translates PCRE regex to string pattern matching (with restrictions) for more information in lua_re_compat.py
For maximum portability of the configuration to the target language:
- If possible, use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement CSS3 selectors standard. They may not fully implement the functionality!
Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
- Several libs not support
+operations (eg: selectolax(modest), dart.universal_html) - For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax
- Several libs not support
- HTML parser libs maybe not supports attribute selectors:
*=,~=,|=,^=,$= - Several libs not support pseudo classes (eg: standard dart.html lib miss this feature).
ssc_gen required python 3.10 version or higher
pip:
pip install ssc_codegenuv:
uv pip install ssc_codegenfrom ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')Note
this tools developed for testing purposes, not for web-scraping tasks
Download any html file and pass as argument:
ssc-gen parse-from-file index.html -t schema.py:HelloWorldShort options descriptions:
-t --target- config schema file and class from where to start the parser
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorldssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorldNote
if script cannot found chrome executable - provide it manually:
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromiumConvert to code for use in projects:
Note
for example, used js: it can be fast test in developer console
ssc-gen js schema.py -o .Code output looks like this:
// autogenerated by ssc-gen DO NOT_EDIT
/***
*
* {
* "title": "String",
* "a_hrefs": "Array<String>"
* }*/
class HelloWorld {
constructor(doc) {
if (typeof doc === "string") {
this._doc = new DOMParser().parseFromString(doc, "text/html");
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(v) {
let v0 = v.querySelector("title");
return typeof v0.textContent === "undefined"
? v0.documentElement.textContent
: v0.textContent;
}
_parseAHrefs(v) {
let v0 = Array.from(v.querySelectorAll("a"));
return v0.map((e) => e.getAttribute("href"));
}
parse() {
return {
title: this._parseTitle(this._doc),
a_hrefs: this._parseAHrefs(this._doc),
};
}
}Print output:
alert(JSON.stringify(new HelloWorld(document).parse()));You can use any html source:
- parse from html files
- parse from http responses
- parse from browsers: playwright, selenium, chrome-cdp, etc.
- call curl in shell and parse STDIN
- use in STDIN pipelines with third-party tools like projectdiscovery/httpx
- Brief about css selectors and regular expressions
- Explain short document on how to understand DSL syntax
- LLM experimental prompt for generate code
- Explain short note how to explain and read sscgen schema configs
- Quickstart about css selectors and regular expressions.
- Tutorial basic usage ssc-gen
- AST reference about generation code from AST


