Abstractive and extractive summarization for Hungarian

Links to the HunSum-2 dataset and our baseline models:

Links to the HunSum-1 dataset and our baseline models:

Setup

conda create --name my-env python=3.8.13
conda activate my-env

conda install -c conda-forge pandoc
pip install -e .

Install LSH package used for deduplication

git clone https://github.com/mattilyra/LSH
cd LSH
git checkout fix/filter_duplicates
pip install -e .

How to create the corpus

Download data from Common Crawl

Install CommonCrawl Downloader

git clone [email protected]:DavidNemeskey/cc_corpus.git
cd cc_corpus
pip install -e .

Download data

Arguments:

text file containing the urls to download: indexes_to_download.txt
path of the cc_corpus
output directory

scripts/download_data.sh indexes_to_download.txt ../cc_corpus/ ../CommonCrawl/

Parse articles

Arguments:

downloaded data
output directory
config file

The cleaned articles will be in the config.clean_out_dir

cd summarization
python entrypoints/parse_warc_pages.py ../../CommonCrawl ../../articles preprocess.yaml

Calculate document embeddings for leads and articles for cleaning

Arguments:

config file

cd summarization
python entrypoints/calc_doc_similarities.py preprocess.yaml

Clean articles

Arguments:

config file

cd summarization
python entrypoints/clean.py preprocess.yaml

Deduplicate articles

Arguments:

config file

cd summarization
python entrypoints/deduplicate.py preprocess.yaml

How to add your own parser

To add a new parser for the corpus creation process, follow these steps:

1. Create a new parser class for the specific website.

Place the parser in the html_parsers package. The parser should inherit from the ParserBase class and implement the following methods:

class MyNewWebsiteParser(ParserBase):
    def check_page_is_valid(self, url, soup):
        # Implement logic to check if the page is valid
        # (e.g. check if the page is a gallery page if it's not indicated by the URL)
        # if needed raise InvalidPageError(url, 'problem description')

    def get_title(self, url, soup) -> str:
        # Implement logic to extract title

    def get_lead(self, soup) -> str:
        # Implement logic to extract lead

    def get_article_text(self, url, soup) -> str:
        # Implement logic to extract the main article text

    def get_date_of_creation(self, soup) -> Optional[datetime]:
        # Implement logic to extract the date of creation

    def get_tags(self, soup) -> Set[str]:
        # Implement logic to extract tags

    def get_html_tags_to_remove(self, soup) -> List[Tag]:
        # Implement logic to specify which HTML tags to remove

    def remove_unnecessary_text_from_article(self, article) -> str:
        # Implement logic to remove unnecessary text from the article (e.g. ads that cannot be removed by HTML tags)
        return article

2. Register your parser in the HtmlParserFactory.

class HtmlParserFactory:
    parsers = {
        ...
        'mywebsite': MyNewWebsiteParser  # Register your new parser here
        ...
    }

You're all set to start parsing your articles with the parse_warc_pages.py script. If you only want to parse your new website, just use the --sites mywebsite option.

Citation

If you use our dataset or models, please cite the following papers:

@inproceedings {HunSum-1,
    title = {{HunSum-1: an Abstractive Summarization Dataset for Hungarian}},
    booktitle = {XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023)},
    year = {2023},
    publisher = {Szegedi Tudományegyetem, Informatikai Intézet},
    address = {Szeged, Magyarország},
    author = {Barta, Botond and Lakatos, Dorina and Nagy, Attila and Nyist, Mil{\'{a}}n Konor and {\'{A}}cs, Judit},
    pages = {231--243}
}

@inproceedings{barta-etal-2024-news-summaries,
    title = "From News to Summaries: Building a {H}ungarian Corpus for Extractive and Abstractive Summarization",
    author = "Barta, Botond  and
      Lakatos, Dorina  and
      Nagy, Attila  and
      Nyist, Mil{\'a}n Konor  and
      {\'A}cs, Judit",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.662",
    pages = "7503--7509",
    abstract = "Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence similarity. We train baseline models for both extractive and abstractive summarization using the collected dataset. To demonstrate the effectiveness of the trained models, we perform both quantitative and qualitative evaluation. Our models and dataset will be made publicly available, encouraging replication, further research, and real-world applications across various domains.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 456 Commits
scripts		scripts
summarization		summarization
.gitignore		.gitignore
README.md		README.md
indexes_to_download.txt		indexes_to_download.txt
kesma.txt		kesma.txt
main.py		main.py
segments_to_download.txt		segments_to_download.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstractive and extractive summarization for Hungarian

Setup

Install LSH package used for deduplication

How to create the corpus

Download data from Common Crawl

Install CommonCrawl Downloader

Download data

Parse articles

Calculate document embeddings for leads and articles for cleaning

Clean articles

Deduplicate articles

How to add your own parser

1. Create a new parser class for the specific website.

2. Register your parser in the HtmlParserFactory.

Citation

About

Releases

Packages

Contributors 3

Languages

botondbarta/HunSum

Folders and files

Latest commit

History

Repository files navigation

Abstractive and extractive summarization for Hungarian

Setup

Install LSH package used for deduplication

How to create the corpus

Download data from Common Crawl

Install CommonCrawl Downloader

Download data

Parse articles

Calculate document embeddings for leads and articles for cleaning

Clean articles

Deduplicate articles

How to add your own parser

1. Create a new parser class for the specific website.

2. Register your parser in the HtmlParserFactory.

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages