-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
72 changed files
with
12,833 additions
and
500 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
all: | ||
@echo "See Makefile for possible targets!" | ||
|
||
build: | ||
python3 setup.py sdist bdist_wheel | ||
|
||
upload: | ||
python3 -m twine upload dist/* | ||
|
||
clean: | ||
rm -rf dist/ build/ *.egg-info/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,190 @@ | ||
# HTML2TEI | ||
Map the HTML schema of portals to valid TEI XML with the tags and structures used in them using small manual portal-specific configurations | ||
|
||
Map the HTML schema of portals to valid [TEI XML](https://tei-c.org/) with the tags and structures used in them using | ||
small manual portal-specific configurations. | ||
|
||
The portal-specific configuration is created manually with the help of three different tools which aid evaluating | ||
the inventory of the tags and structures used in the HTML code. The manual evaluation of such structures | ||
enables one to create a valid TEI XML from the HTML source keeping all desired (text) shema elements | ||
in a fine-grained way carefully supervised by the user. In addition to converting the article body, | ||
the metadata can be converted to the [Schema.org](https://schema.org/) standard. | ||
|
||
The conversion process is automatic and scales well on large portals with the same schema | ||
|
||
## Requirements | ||
|
||
- Python 3.6+ | ||
- For Newspaper3k, the installation of the following packages must precede the installation of this program: | ||
`python3-dev libxml2-dev libxslt-dev libjpeg-dev zlib1g-dev libpng12-dev` | ||
|
||
## Install | ||
|
||
### pip | ||
|
||
`pip3 install html2tei` | ||
|
||
### Manual | ||
|
||
1. `git clone https://github.com/ELTE-DH/HTML2TEI.git` | ||
2. Run `python3 setup.py install` (you may have to use `sudo` at the beginning of this command) | ||
|
||
## Usage | ||
|
||
This program is designed to be used with [WebArticleCurator](https://github.com/elte-dh/WebArticleCurator/). | ||
The article WARC files should be placed in a directory (`warc-dir`) and a configuration YAML must | ||
map the WARC files to the specific portal configuration. | ||
The program can be run from command line or from the Python API see the details below | ||
|
||
### Modes | ||
|
||
There are five modes of the program: | ||
|
||
- Create _HTML Content Tree_ (`content-tree`): Read all the articles to summarize all the structures that occur | ||
in the portal schema. Finally the accumulated information represents the tree structure as a nested YAML dictionary | ||
(for manual inspection) | ||
- The _Tag Inventory Maker_ (`inventory-maker`): Create the tag tables from the articles with their | ||
gathered information (it will be the basis for manual configuration of renaming) | ||
- The _Tag Bigrams Maker_ (`bigram-maker`): Create the bigram tag table from the articles with their | ||
gathered information (this table is an add-on that can be used to map the schema) | ||
- The _Portal Article Cleaner_ (`cleaner`): Create the TEI XMLs from the site-specific configuration and | ||
from the tables supplemented with new label names | ||
- _Diff Tag Tables_ (`diff-tables`): Compare and update the generated (and modified) tables if there are new data | ||
for the same portal | ||
|
||
### Command Line Arguments | ||
|
||
#### Common Arguments | ||
|
||
- `-i`, `--input-config`: WARC filename to portal name mappig in YAML | ||
- `-c`, `--configs-dir`: The directory for portal-speicific configs | ||
- `-l`, `--log-dir`: The directory for putting logs | ||
- `-w`, `--warc-dir`: The directory to read WARCs from | ||
- `-o`, `--output-dir`: The directory to put output files | ||
- `-L`, `--log-level`: Log verbosity level (default: INFO)' | ||
|
||
The files and directories must present. All arguments except `log-level` are mandatory for the following four modes | ||
|
||
#### HTML Content Tree (`content-tree`) | ||
|
||
- `-t`, `--task-name`: The name of the task to appear in the logs (default: HTML Content Tree) | ||
|
||
#### Tag Inventory Maker (`inventory-maker`) | ||
|
||
- `-t`, `--task-name`: The name of the task to appear in the logs (default: Tag Inventory Maker) | ||
- `-r`, `--recursive`: Use just direct descendants or all (default: True) | ||
|
||
#### Tag Bigrams Maker (`bigram-maker`) | ||
|
||
- `-t`, `--task-name`: The name of the task to appear in the logs (default: Tag Bigrams Maker) | ||
- `-r`, `--recursive`: Use just direct descendants or all (default: True) | ||
|
||
#### Portal Article Cleaner (`cleaner`) | ||
|
||
- `-m`, `--write-out-mode`: The schema removal tool to use (ELTEDH, JusText, Newspaper3k) (default: eltedh) | ||
- `-t`, `--task-name`: The name of the task to appear in the logs (default: Portal Article Cleaner) | ||
- `-O`, `--output-debug`: Normal output generation (validate-hash-compress and UUID file names) or print into | ||
the output directory without validation using human-firendly names (default: False, normal output) | ||
- `-p`, `--run-parallel`: Run processing in parallel or all operation must be used sequentially | ||
(default: True, parallel) | ||
- `-d`, `--with-specific-dicts`: Load portal-specific dictionaries (tables) (default: True) | ||
- `-b`, `--with-specific-base-tei`: Load portal-specific base TEI XML (default: True) | ||
|
||
#### Diff Tag Tables (`diff-tables`) | ||
|
||
- `--diff-dir`: The directory which contains the directories | ||
- `--old-filename`: The filename for the old table | ||
- `--new-filename`: The filename for the new table | ||
- `--merge-filename`: The filename for the merged table | ||
|
||
### Python API | ||
|
||
#### Helper functions for the Configs | ||
|
||
- `parse_date(date_raw, date_format, locale='hu_HU.UTF-8')`: Parse date according to the parameters | ||
(locale and date format) | ||
- `BASIC_LINK_ATTRS`: A basic list of html tags that contain attributes to preserve. It can be overwritten based on | ||
the set of the given portal | ||
- `decompose_listed_subtrees_and_mark_media_descendants(article_dec, decomp, media_list)`: | ||
Mark the lower level of the media blocks and delete tags to be deleted | ||
- `tei_defaultdict(mandatory_keys=('sch:url', 'sch:name'), missing_value=None)`: | ||
Create a defaultdict preinitialized with the mandatory Schema.org keys set to default | ||
|
||
# For the Main Pyhton API | ||
|
||
- `run_main(warc_filename, configs_dir, log_dir, warc_dir, output_dir, init_portal_fun, | ||
run_params=None, logfile_level='INFO', console_level='INFO')`: Main runner funtion | ||
- `WRITE_OUT_MODES`: A dictionary to add custom write-out modes when needed | ||
- `diff_all_tag_table(diff_dir, old_filename, new_filename, out_filename)`: The main function to update tables | ||
- `tag_bigrams_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params, | ||
rest_config_params)`: The portal initator function as called from CLI argument | ||
- `content_tree_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params, | ||
rest_config_params)`: The portal initator function as called from CLI argument | ||
- `tag_inventory_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params, | ||
rest_config_params)`: The portal initator function as called from CLI argument | ||
- `portal_article_cleaner_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params, | ||
rest_config_params)`: The portal initator function as called from CLI argument | ||
|
||
# For the Low-level API: Defining Custom Modes | ||
|
||
- `init_output_writer(output_dir, portal_name, output_debug, tei_logger)`: Initialises the class for writing output | ||
(into a zipfile or a directory) | ||
- `create_new_tag_with_string(beauty_xml, tag_string, tag_name, append_to=None)`: Helper function to create | ||
a new XML tag containing string in it. If provided append the newly created tag to a parent tag | ||
- `immediate_text(tag)`: Count the number of words (non-wthitespace text) immediately under | ||
the parameter tag excluding comments | ||
- `to_friendly(ch, excluded_tags_fun)`: Convert tag name and sorted attributes to string in order to use it later | ||
(e.g. tag_freezer in the tables) | ||
- `run_single_process(warc_filename, file_names_and_modes, main_function, sub_functions, after_function, after_params)`: | ||
Read a WARC file and sequentally process all articles in it with main_function (multi-page articles are handled | ||
as one entry) and yield the result after filtered through `after_function` | ||
- `run_multiple_process(warc_filename, file_names_and_modes, main_function, sub_functions, after_function, | ||
after_params)`: Read a WARC file and sequentally process all articles in it with main_function in parallel preserving | ||
ordering (multi-page articles are handled as one entry) and yield the result after filtered through `after_function` | ||
- `dummy_fun(*_)`: A function always returns None no matter how many arguments were given | ||
- `process_article`: A generic article processsing skeleton used by multiple targets | ||
|
||
# Licence | ||
|
||
This project is licensed under the terms of the GNU LGPL 3.0 license. | ||
|
||
# References | ||
|
||
The DOI of the code is: TODO | ||
|
||
If you use this program, please cite the following paper: | ||
|
||
[__The ELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata__ Balázs Indig, Árpád Knap, | ||
Zsófia Sárközi-Lindner, Mária Timári, Gábor Palkó _In the Proceedings of the 12th Web as Corpus Workshop (WAC XII)_, | ||
pages 33-41 Marseille, France 2020](https://www.aclweb.org/anthology/2020.wac-1.5.pdf) | ||
|
||
``` | ||
@inproceedings{indig-etal-2020-elte, | ||
title = "The {ELTE}.{DH} Pilot Corpus {--} Creating a Handcrafted {G}igaword Web Corpus with Metadata", | ||
author = {Indig, Bal{\'a}zs and | ||
Knap, {\'A}rp{\'a}d and | ||
S{\'a}rk{\"o}zi-Lindner, Zs{\'o}fia and | ||
Tim{\'a}ri, M{\'a}ria and | ||
Palk{\'o}, G{\'a}bor}, | ||
booktitle = "Proceedings of the 12th Web as Corpus Workshop", | ||
month = may, | ||
year = "2020", | ||
address = "Marseille, France", | ||
publisher = "European Language Resources Association", | ||
url = "https://www.aclweb.org/anthology/2020.wac-1.5", | ||
pages = "33--41", | ||
abstract = "In this article, we present the method we used to create a middle-sized corpus using | ||
targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful | ||
for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies | ||
rule-based components that allow the curation of the text and the metadata content. The curated data can thereon | ||
serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and | ||
customisation. Our concept can also be applied to other genres of portals by using the discovered patterns | ||
in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, | ||
our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal | ||
manpower and computational resources. Reproducing the corpus is possible if changes are introduced | ||
to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used | ||
for the output format, but we stress that placing the corpus in a digital repository system is recommended | ||
in order to be able to define semantic relations between the segments and to add rich annotation.", | ||
language = "English", | ||
ISBN = "979-10-95546-68-9", | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> | ||
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" | ||
schematypens="http://purl.oclc.org/dsdl/schematron"?> | ||
<TEI xmlns="http://www.tei-c.org/ns/1.0"> | ||
<teiHeader> | ||
<fileDesc> | ||
<titleStmt> | ||
<title></title> | ||
</titleStmt> | ||
<editionStmt> | ||
<edition>ELTE-DH webcrawling</edition> | ||
<respStmt> | ||
<resp>creator</resp> | ||
<orgName>ELTE-DH<ref type="http://elte-dh.hu"/> | ||
</orgName> | ||
</respStmt> | ||
<respStmt> | ||
<resp>project director</resp> | ||
<persName><surname>Palkó</surname><forename>Gábor</forename> | ||
<ref>https://orcid.org/0000-0002-4394-8577</ref></persName> | ||
</respStmt> | ||
<respStmt> | ||
<resp>chief programmer</resp> | ||
<persName><surname>Indig</surname><forename>Balázs</forename> | ||
<ref>https://orcid.org/0000-0001-8090-3661</ref></persName> | ||
</respStmt> | ||
<respStmt> | ||
<resp>TEI expert</resp> | ||
<persName><surname>Fellegi</surname><forename>Zsófia</forename> | ||
<ref>https://orcid.org/0000-0001-9199-1759</ref></persName> | ||
</respStmt> | ||
<respStmt> | ||
<resp>programmer</resp> | ||
<persName><surname>Sárközi-Lindner</surname><forename>Zsófia</forename> | ||
<ref>https://orcid.org/0000-0002-2558-0633</ref></persName> | ||
</respStmt> | ||
</editionStmt> | ||
<publicationStmt> | ||
<publisher> | ||
<orgName>ELTE-DH</orgName> | ||
<ref type="url">http://elte-dh.hu/</ref> | ||
</publisher> | ||
<pubPlace>Budapest <ref type="url">http://www.geonames.org/3054643</ref> | ||
</pubPlace> | ||
<date>2020</date> | ||
<availability> | ||
<p>Metadata: IN COPYRIGHT - NON-COMMERCIAL USE PERMITTED<ref type="url" | ||
>http://rightsstatements.org/vocab/InC-NC/1.0/</ref></p> | ||
<p>Text: IN COPYRIGHT <ref type="url" | ||
>http://rightsstatements.org/vocab/InC/1.0/</ref> | ||
</p> | ||
</availability> | ||
<idno type="PID"></idno> | ||
</publicationStmt> | ||
<sourceDesc> | ||
<bibl> | ||
<title></title> | ||
<publisher><orgName>Real Reporting Foundation</orgName> | ||
<placeName>1377-C Spencer Avenue, Lancaster, PA 17603</placeName> | ||
<ref type="url" source="https://abcug.hu/impresszum/">https://doi.org/10.5281/zenodo.3974489</ref> | ||
<date when="2020-10-01"/> | ||
</publisher> | ||
<pubPlace> | ||
Budapest | ||
<ref type="url">http://www.geonames.org/3054643</ref> | ||
</pubPlace> | ||
<availability><p>Copyright © 2017 · Newslanc.com LLC. Minden jog fenntartva.</p> | ||
<p><ref type="url" source="https://abcug.hu/impresszum/">https://doi.org/10.5281/zenodo.3974489</ref> | ||
<date when="2020-10-01"/></p> | ||
</availability> | ||
<date></date> | ||
</bibl> | ||
</sourceDesc> | ||
</fileDesc> | ||
<xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | ||
xmlns:sch="https://schema.org" xmlns:skos="http://www.w3.org/2008/05/skos-xl#"> | ||
<rdf:RDF> | ||
<rdf:Description rdf:about=""> | ||
<sch:type rdf:resource="https://schema.org/NewsArticle"/> | ||
<sch:ispartOf rdf:resource="https://abcug.hu/">Abcúg</sch:ispartOf> | ||
<sch:inLanguage>hun</sch:inLanguage> | ||
<sch:license rdf:resource="http://rightsstatements.org/vocab/InC-EDU/1.0/">In Copyright</sch:license> | ||
</rdf:Description> | ||
</rdf:RDF> | ||
</xenoData> | ||
<xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | ||
xmlns:sch="https://schema.org"> | ||
<rdf:RDF> | ||
<rdf:Description rdf:about="https://doi.org/10.5281/zenodo.3974489"> | ||
<sch:type>WARC/1.1</sch:type> | ||
<sch:sdDatePublished> | ||
2020-01-27T18:58:23/2020-01-27T22:58:20 | ||
</sch:sdDatePublished> | ||
<sch:identifier rdf:about="https://doi.org/10.5281/zenodo.3974489"/> | ||
<sch:identifier> [{"checksum": "9e98422362b60eae233f0e569faaee3e", "filename": | ||
"abcug-archive_new.warc.gz", "filesize": 2936627, "id": | ||
"b84379de-7f79-422c-b06f-9babb7e311ec", "links": {"download": | ||
"https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/abcug-archive_new.warc.gz", | ||
"self": | ||
"https://zenodo.org/api/deposit/depositions/3974489/files/b84379de-7f79-422c-b06f-9babb7e311ec"}}, | ||
{"checksum": "bbb88779b071590ac08188d3c80742bb", "filename": | ||
"abcug-articles_new.warc.gz", "filesize": 40344565, "id": | ||
"5aa84456-b59d-4d48-804e-694a70e86df7", "links": {"download": | ||
"https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/abcug-articles_new.warc.gz", | ||
"self": | ||
"https://zenodo.org/api/deposit/depositions/3974489/files/5aa84456-b59d-4d48-804e-694a70e86df7"}}, | ||
{"checksum": "f044cd977eb9d214dd270ac879e76da0", "filename": "log.log", | ||
"filesize": 24638, "id": "925a1387-226e-449b-bc55-aba9042e71b1", "links": | ||
{"download": | ||
"https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/log.log", | ||
"self": | ||
"https://zenodo.org/api/deposit/depositions/3974489/files/925a1387-226e-449b-bc55-aba9042e71b1"}}, | ||
{"checksum": "2e6ca52568a04f9977fc82ecb9bf2bc0", "filename": "logs.tar.gz", | ||
"filesize": 2562, "id": "43b691af-e41e-4961-80fd-3289877eb0f6", "links": | ||
{"download": | ||
"https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/logs.tar.gz", | ||
"self": | ||
"https://zenodo.org/api/deposit/depositions/3974489/files/43b691af-e41e-4961-80fd-3289877eb0f6"}}, | ||
{"checksum": "0ca66bb77ce5179dc6b25ccd754a0c69", "filename": "script.sh", | ||
"filesize": 384, "id": "b385442f-e525-43d8-9b8e-2a14e25d67f0", "links": | ||
{"download": | ||
"https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/script.sh", | ||
"self": | ||
"https://zenodo.org/api/deposit/depositions/3974489/files/b385442f-e525-43d8-9b8e-2a14e25d67f0"}}] </sch:identifier> | ||
<sch:identifier>urn:uuid:930b53b8-ef8e-4406-be4d-37ad64e9a549</sch:identifier> </rdf:Description> | ||
</rdf:RDF> | ||
</xenoData> | ||
<xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | ||
xmlns:sch="https://schema.org"> | ||
<rdf:RDF> | ||
<rdf:Description rdf:about="teiPid"> | ||
<sch:identifier></sch:identifier> | ||
<sch:type>TEI</sch:type> | ||
<sch:sdDatePublished></sch:sdDatePublished> | ||
<sch:lastReviewed></sch:lastReviewed> | ||
<sch:contributor rdf:resource="https://orcid.org/0000-0002-4394-8577">Palkó Gábor</sch:contributor> | ||
<sch:contributor rdf:resource="https://orcid.org/0000-0001-8090-3661">Indig Balázs</sch:contributor> | ||
<sch:contributor rdf:resource="https://orcid.org/0000-0001-9199-1759">Fellegi Zsófia</sch:contributor> | ||
<sch:contributor rdf:resource="https://orcid.org/0000-0002-2558-0633">Sárközi-Lindner Zsófia</sch:contributor> | ||
<sch:license rdf:resource="http://rightsstatements.org/vocab/InC/1.0/"/> | ||
</rdf:Description> | ||
</rdf:RDF> | ||
</xenoData> | ||
<revisionDesc> | ||
<change source="teiPID">TEI file created</change> | ||
</revisionDesc> | ||
</teiHeader> | ||
<text> | ||
<body> | ||
</body> | ||
</text> | ||
</TEI> | ||
|
Oops, something went wrong.