Initial public version

ELTE-DH · May 13, 2021 · 9794ac1 · 9794ac1
1 parent ff404af
commit 9794ac1
Show file tree

Hide file tree

Showing 72 changed files with 12,833 additions and 500 deletions.
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,13 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+#PyCharm
+.idea
+
+# Special
+*/*.warc.gz
+
+# *.txt
+tei_maker_output
+preparatory_output
diff --git a/LICENSE b/LICENSE
diff --git a/Makefile b/Makefile
@@ -0,0 +1,11 @@
+all:
+	@echo "See Makefile for possible targets!"
+
+build:
+	python3 setup.py sdist bdist_wheel
+
+upload:
+	python3 -m twine upload dist/*
+
+clean:
+	rm -rf dist/ build/ *.egg-info/
diff --git a/README.md b/README.md
@@ -1,2 +1,190 @@
 # HTML2TEI
-Map the HTML schema of portals to valid TEI XML with the tags and structures used in them using  small manual portal-specific configurations
+
+Map the HTML schema of portals to valid [TEI XML](https://tei-c.org/) with the tags and structures used in them using
+ small manual portal-specific configurations.
+
+The portal-specific configuration is created manually with the help of three different tools which aid evaluating
+ the inventory of the tags and structures used in the HTML code. The manual evaluation of such structures
+ enables one to create a valid TEI XML from the HTML source keeping all desired (text) shema elements
+ in a fine-grained way carefully supervised by the user. In addition to converting the article body,
+ the metadata can be converted to the [Schema.org](https://schema.org/) standard.
+
+The conversion process is automatic and scales well on large portals with the same schema
+
+## Requirements
+
+- Python 3.6+
+- For Newspaper3k, the installation of the following packages must precede the installation of this program:
+  `python3-dev libxml2-dev libxslt-dev libjpeg-dev zlib1g-dev libpng12-dev`
+
+## Install
+
+### pip
+
+`pip3 install html2tei`
+
+### Manual
+
+1. `git clone https://github.com/ELTE-DH/HTML2TEI.git`
+2. Run `python3 setup.py install` (you may have to use `sudo` at the beginning of this command)
+
+## Usage
+
+This program is designed to be used with [WebArticleCurator](https://github.com/elte-dh/WebArticleCurator/).
+The article WARC files should be placed in a directory (`warc-dir`) and a configuration YAML must
+ map the WARC files to the specific portal configuration.
+The program can be run from command line or from the Python API see the details below 
+
+### Modes
+
+There are five modes of the program:
+
+- Create _HTML Content Tree_ (`content-tree`): Read all the articles to summarize all the structures that occur
+  in the portal schema. Finally the accumulated information represents the tree structure as a nested YAML dictionary
+  (for manual inspection)
+- The _Tag Inventory Maker_ (`inventory-maker`): Create the tag tables from the articles with their
+  gathered information (it will be the basis for manual configuration of renaming)
+- The _Tag Bigrams Maker_ (`bigram-maker`): Create the bigram tag table from the articles with their
+  gathered information (this table is an add-on that can be used to map the schema)
+- The _Portal Article Cleaner_ (`cleaner`): Create the TEI XMLs from the site-specific configuration and
+  from the tables supplemented with new label names
+- _Diff Tag Tables_ (`diff-tables`): Compare and update the generated (and modified) tables if there are new data
+  for the same portal
+
+### Command Line Arguments 
+
+#### Common Arguments
+
+- `-i`, `--input-config`: WARC filename to portal name mappig in YAML
+- `-c`, `--configs-dir`: The directory for portal-speicific configs
+- `-l`, `--log-dir`: The directory for putting logs
+- `-w`, `--warc-dir`: The directory to read WARCs from
+- `-o`, `--output-dir`: The directory to put output files
+- `-L`, `--log-level`: Log verbosity level (default: INFO)'
+
+The files and directories must present. All arguments except `log-level` are mandatory for the following four modes
+
+#### HTML Content Tree (`content-tree`)
+
+- `-t`, `--task-name`: The name of the task to appear in the logs (default: HTML Content Tree)
+
+#### Tag Inventory Maker (`inventory-maker`)
+
+- `-t`, `--task-name`: The name of the task to appear in the logs (default: Tag Inventory Maker)
+- `-r`, `--recursive`: Use just direct descendants or all (default: True)
+
+#### Tag Bigrams Maker (`bigram-maker`)
+
+- `-t`, `--task-name`: The name of the task to appear in the logs (default: Tag Bigrams Maker)
+- `-r`, `--recursive`: Use just direct descendants or all (default: True)
+
+#### Portal Article Cleaner (`cleaner`)
+
+- `-m`, `--write-out-mode`: The schema removal tool to use (ELTEDH, JusText, Newspaper3k) (default: eltedh)
+- `-t`, `--task-name`: The name of the task to appear in the logs (default: Portal Article Cleaner)
+- `-O`, `--output-debug`: Normal output generation (validate-hash-compress and UUID file names) or print into
+  the output directory without validation using human-firendly names (default: False, normal output)
+- `-p`, `--run-parallel`: Run processing in parallel or all operation must be used sequentially
+  (default: True, parallel)
+- `-d`, `--with-specific-dicts`: Load portal-specific dictionaries (tables) (default: True)
+- `-b`, `--with-specific-base-tei`: Load portal-specific base TEI XML (default: True)
+
+#### Diff Tag Tables (`diff-tables`)
+
+- `--diff-dir`: The directory which contains the directories
+- `--old-filename`: The filename for the old table 
+- `--new-filename`: The filename for the new table
+- `--merge-filename`: The filename for the merged table
+
+### Python API
+
+#### Helper functions for the Configs
+
+- `parse_date(date_raw, date_format, locale='hu_HU.UTF-8')`: Parse date according to the parameters
+  (locale and date format) 
+- `BASIC_LINK_ATTRS`: A basic list of html tags that contain attributes to preserve. It can be overwritten based on
+  the set of the given portal
+- `decompose_listed_subtrees_and_mark_media_descendants(article_dec, decomp, media_list)`: 
+  Mark the lower level of the media blocks and delete tags to be deleted
+- `tei_defaultdict(mandatory_keys=('sch:url', 'sch:name'), missing_value=None)`:
+  Create a defaultdict preinitialized with the mandatory Schema.org keys set to default
+
+# For the Main Pyhton API
+
+- `run_main(warc_filename, configs_dir, log_dir, warc_dir, output_dir, init_portal_fun,
+            run_params=None, logfile_level='INFO', console_level='INFO')`: Main runner funtion
+- `WRITE_OUT_MODES`: A dictionary to add custom write-out modes when needed
+- `diff_all_tag_table(diff_dir, old_filename, new_filename, out_filename)`: The main function to update tables
+- `tag_bigrams_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params,
+                           rest_config_params)`: The portal initator function as called from CLI argument
+- `content_tree_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params,
+                            rest_config_params)`: The portal initator function as called from CLI argument
+- `tag_inventory_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params,
+                             rest_config_params)`: The portal initator function as called from CLI argument
+- `portal_article_cleaner_init_portal(log_dir, output_dir, run_params, portal_name, tei_logger, warc_level_params,
+                                      rest_config_params)`: The portal initator function as called from CLI argument
+
+# For the Low-level API: Defining Custom Modes
+
+- `init_output_writer(output_dir, portal_name, output_debug, tei_logger)`: Initialises the class for writing output
+  (into a zipfile or a directory)
+- `create_new_tag_with_string(beauty_xml, tag_string, tag_name, append_to=None)`: Helper function to create
+  a new XML tag containing string in it. If provided append the newly created tag to a parent tag
+- `immediate_text(tag)`: Count the number of words (non-wthitespace text) immediately under
+  the parameter tag excluding comments
+- `to_friendly(ch, excluded_tags_fun)`: Convert tag name and sorted attributes to string in order to use it later
+  (e.g. tag_freezer in the tables)
+- `run_single_process(warc_filename, file_names_and_modes, main_function, sub_functions, after_function, after_params)`:
+  Read a WARC file and sequentally process all articles in it with main_function (multi-page articles are handled
+  as one entry) and yield the result after filtered through `after_function`
+- `run_multiple_process(warc_filename, file_names_and_modes, main_function, sub_functions, after_function,
+  after_params)`: Read a WARC file and sequentally process all articles in it with main_function in parallel preserving
+  ordering (multi-page articles are handled as one entry) and yield the result after filtered through `after_function`
+- `dummy_fun(*_)`: A function always returns None no matter how many arguments were given
+- `process_article`: A generic article processsing skeleton used by multiple targets
+
+# Licence
+
+This project is licensed under the terms of the GNU LGPL 3.0 license.
+
+# References
+
+The DOI of the code is: TODO
+
+If you use this program, please cite the following paper:
+
+[__The ELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata__ Balázs Indig, Árpád Knap, 
+Zsófia Sárközi-Lindner, Mária Timári, Gábor Palkó _In the Proceedings of the 12th Web as Corpus Workshop (WAC XII)_,
+pages 33-41 Marseille, France 2020](https://www.aclweb.org/anthology/2020.wac-1.5.pdf)
+
+```
+@inproceedings{indig-etal-2020-elte,
+    title = "The {ELTE}.{DH} Pilot Corpus {--} Creating a Handcrafted {G}igaword Web Corpus with Metadata",
+    author = {Indig, Bal{\'a}zs  and
+      Knap, {\'A}rp{\'a}d  and
+      S{\'a}rk{\"o}zi-Lindner, Zs{\'o}fia  and
+      Tim{\'a}ri, M{\'a}ria  and
+      Palk{\'o}, G{\'a}bor},
+    booktitle = "Proceedings of the 12th Web as Corpus Workshop",
+    month = may,
+    year = "2020",
+    address = "Marseille, France",
+    publisher = "European Language Resources Association",
+    url = "https://www.aclweb.org/anthology/2020.wac-1.5",
+    pages = "33--41",
+    abstract = "In this article, we present the method we used to create a middle-sized corpus using
+     targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful
+     for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies
+     rule-based components that allow the curation of the text and the metadata content. The curated data can thereon
+     serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and
+     customisation. Our concept can also be applied to other genres of portals by using the discovered patterns
+     in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus,
+     our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal
+     manpower and computational resources. Reproducing the corpus is possible if changes are introduced
+     to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used
+     for the output format, but we stress that placing the corpus in a digital repository system is recommended
+     in order to be able to define semantic relations between the segments and to add rich annotation.",
+    language = "English",
+    ISBN = "979-10-95546-68-9",
+}
+```
diff --git a/configs/abcug/abcug_BASE.xml b/configs/abcug/abcug_BASE.xml
@@ -0,0 +1,154 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
+<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
+	schematypens="http://purl.oclc.org/dsdl/schematron"?>
+<TEI xmlns="http://www.tei-c.org/ns/1.0">
+    <teiHeader>
+        <fileDesc>
+            <titleStmt>
+                <title></title>
+            </titleStmt>
+            <editionStmt>
+                <edition>ELTE-DH webcrawling</edition>
+                <respStmt>
+                    <resp>creator</resp>
+                    <orgName>ELTE-DH<ref type="http://elte-dh.hu"/>
+                    </orgName>
+                </respStmt>
+                <respStmt>
+                    <resp>project director</resp>
+                    <persName><surname>Palkó</surname><forename>Gábor</forename>
+                        <ref>https://orcid.org/0000-0002-4394-8577</ref></persName>
+                </respStmt>
+                <respStmt>
+                    <resp>chief programmer</resp>
+                    <persName><surname>Indig</surname><forename>Balázs</forename>
+                        <ref>https://orcid.org/0000-0001-8090-3661</ref></persName>
+                </respStmt>
+                <respStmt>
+                    <resp>TEI expert</resp>
+                    <persName><surname>Fellegi</surname><forename>Zsófia</forename>
+                        <ref>https://orcid.org/0000-0001-9199-1759</ref></persName>
+                </respStmt>
+                <respStmt>
+                    <resp>programmer</resp>
+                    <persName><surname>Sárközi-Lindner</surname><forename>Zsófia</forename>
+                        <ref>https://orcid.org/0000-0002-2558-0633</ref></persName>
+                </respStmt>
+            </editionStmt>
+            <publicationStmt>
+                <publisher>
+                    <orgName>ELTE-DH</orgName>
+                    <ref type="url">http://elte-dh.hu/</ref>
+                </publisher>
+                <pubPlace>Budapest <ref type="url">http://www.geonames.org/3054643</ref>
+                </pubPlace>
+                <date>2020</date>
+                <availability>
+                    <p>Metadata: IN COPYRIGHT - NON-COMMERCIAL USE PERMITTED<ref type="url"
+                        >http://rightsstatements.org/vocab/InC-NC/1.0/</ref></p>
+                    <p>Text: IN COPYRIGHT <ref type="url"
+                        >http://rightsstatements.org/vocab/InC/1.0/</ref>
+                    </p>
+                </availability>
+                <idno type="PID"></idno>
+            </publicationStmt>
+            <sourceDesc>
+                <bibl>
+                    <title></title>
+                    <publisher><orgName>Real Reporting Foundation</orgName>
+                        <placeName>1377-C Spencer Avenue, Lancaster, PA 17603</placeName>
+                        <ref type="url" source="https://abcug.hu/impresszum/">https://doi.org/10.5281/zenodo.3974489</ref>
+                        <date when="2020-10-01"/>
+                    </publisher>
+                    <pubPlace>
+                        Budapest
+                        <ref type="url">http://www.geonames.org/3054643</ref>
+                    </pubPlace>
+                    <availability><p>Copyright © 2017 · Newslanc.com LLC. Minden jog fenntartva.</p>
+                        <p><ref type="url" source="https://abcug.hu/impresszum/">https://doi.org/10.5281/zenodo.3974489</ref>
+                            <date when="2020-10-01"/></p>
+                    </availability>
+                    <date></date>
+                </bibl>
+            </sourceDesc>
+        </fileDesc>
+        <xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+            xmlns:sch="https://schema.org" xmlns:skos="http://www.w3.org/2008/05/skos-xl#">
+            <rdf:RDF>
+                <rdf:Description rdf:about="">
+                    <sch:type rdf:resource="https://schema.org/NewsArticle"/>
+                    <sch:ispartOf rdf:resource="https://abcug.hu/">Abcúg</sch:ispartOf>
+                    <sch:inLanguage>hun</sch:inLanguage>
+                    <sch:license rdf:resource="http://rightsstatements.org/vocab/InC-EDU/1.0/">In Copyright</sch:license>
+                </rdf:Description>
+            </rdf:RDF>
+        </xenoData>
+        <xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+            xmlns:sch="https://schema.org">
+            <rdf:RDF>
+                <rdf:Description rdf:about="https://doi.org/10.5281/zenodo.3974489">
+                    <sch:type>WARC/1.1</sch:type>
+                    <sch:sdDatePublished>
+                        2020-01-27T18:58:23/2020-01-27T22:58:20
+                    </sch:sdDatePublished>
+                    <sch:identifier rdf:about="https://doi.org/10.5281/zenodo.3974489"/>
+                    <sch:identifier> [{"checksum": "9e98422362b60eae233f0e569faaee3e", "filename":
+                        "abcug-archive_new.warc.gz", "filesize": 2936627, "id":
+                        "b84379de-7f79-422c-b06f-9babb7e311ec", "links": {"download":
+                        "https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/abcug-archive_new.warc.gz",
+                        "self":
+                        "https://zenodo.org/api/deposit/depositions/3974489/files/b84379de-7f79-422c-b06f-9babb7e311ec"}},
+                        {"checksum": "bbb88779b071590ac08188d3c80742bb", "filename":
+                        "abcug-articles_new.warc.gz", "filesize": 40344565, "id":
+                        "5aa84456-b59d-4d48-804e-694a70e86df7", "links": {"download":
+                        "https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/abcug-articles_new.warc.gz",
+                        "self":
+                        "https://zenodo.org/api/deposit/depositions/3974489/files/5aa84456-b59d-4d48-804e-694a70e86df7"}},
+                        {"checksum": "f044cd977eb9d214dd270ac879e76da0", "filename": "log.log",
+                        "filesize": 24638, "id": "925a1387-226e-449b-bc55-aba9042e71b1", "links":
+                        {"download":
+                        "https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/log.log",
+                        "self":
+                        "https://zenodo.org/api/deposit/depositions/3974489/files/925a1387-226e-449b-bc55-aba9042e71b1"}},
+                        {"checksum": "2e6ca52568a04f9977fc82ecb9bf2bc0", "filename": "logs.tar.gz",
+                        "filesize": 2562, "id": "43b691af-e41e-4961-80fd-3289877eb0f6", "links":
+                        {"download":
+                        "https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/logs.tar.gz",
+                        "self":
+                        "https://zenodo.org/api/deposit/depositions/3974489/files/43b691af-e41e-4961-80fd-3289877eb0f6"}},
+                        {"checksum": "0ca66bb77ce5179dc6b25ccd754a0c69", "filename": "script.sh",
+                        "filesize": 384, "id": "b385442f-e525-43d8-9b8e-2a14e25d67f0", "links":
+                        {"download":
+                        "https://zenodo.org/api/files/8d361780-d716-41b7-9795-03292d78ceee/script.sh",
+                        "self":
+                        "https://zenodo.org/api/deposit/depositions/3974489/files/b385442f-e525-43d8-9b8e-2a14e25d67f0"}}] </sch:identifier>
+                    <sch:identifier>urn:uuid:930b53b8-ef8e-4406-be4d-37ad64e9a549</sch:identifier> </rdf:Description>
+            </rdf:RDF>
+        </xenoData>
+        <xenoData xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+            xmlns:sch="https://schema.org">
+            <rdf:RDF>
+                <rdf:Description rdf:about="teiPid">
+                    <sch:identifier></sch:identifier>
+                    <sch:type>TEI</sch:type>
+                    <sch:sdDatePublished></sch:sdDatePublished>
+                    <sch:lastReviewed></sch:lastReviewed>
+                    <sch:contributor rdf:resource="https://orcid.org/0000-0002-4394-8577">Palkó Gábor</sch:contributor>
+                    <sch:contributor rdf:resource="https://orcid.org/0000-0001-8090-3661">Indig Balázs</sch:contributor>
+                    <sch:contributor rdf:resource="https://orcid.org/0000-0001-9199-1759">Fellegi Zsófia</sch:contributor>
+                    <sch:contributor rdf:resource="https://orcid.org/0000-0002-2558-0633">Sárközi-Lindner Zsófia</sch:contributor>
+                    <sch:license rdf:resource="http://rightsstatements.org/vocab/InC/1.0/"/>
+                </rdf:Description>
+            </rdf:RDF>
+        </xenoData>
+        <revisionDesc>
+            <change source="teiPID">TEI file created</change>
+        </revisionDesc>
+    </teiHeader>
+    <text>
+        <body>
+        </body>
+    </text>
+</TEI>
+