german-news

german_news offers utilities for building topic-specific news corpus collections from online German media outlets. Articles relevant for a given topic are retrieved based on keywords stems.

Features

Built using the standard Scrapy project setup and layout.
Provides spiders for 42 German media outlets.
Can be extended with new spiders for other outlets, customed pipelines, extension, and middlewares.

Extracted information

german-news extracts the following attributes from news articles:

headline
abstract (lead paragraph)
body (main text)
URL
name(s) of author(s)
publication date
modification date
news keyords
recommendations (i.e. links to other articles suggested by the outlet)
query keywords (i.e. keywords used for determing whether the article is relevant for the topic)

Usage

Crawling an outlet

Configurations for the desired spider can be set in settings.py.

The following topic-specific conditions are currently supported and need to be specified:

Stopping condition: item count or timeout
Topic
Publication date timeframe
Minimum article length
Minimum keyword frequency
Minimum distance between keywords in text
Keywords

Run the code

scrapy crawl $OUTLET

Creating a dataset from scraped articles

python preprocess_data 

optional arguments:
--topic                                     Topic for which the dataset should be created (default: refugees_migration)
--create_processed                          Indicate whether to create the processed or the raw data (default: True)
--drop_duplicates                           Indicate whether to drop duplicates from the dataset (default: True)
--drop_non_german_articles                  Indicate whether to drop non-German articles from the (default: True)
--dop_outliers                              Indicate whether to drop outlier articles (e.g. too long, too short) (default: True)
--drop_news_ticker                          Indicate whether to drop news tickers (i.e. articles with more than a predefined number of subheaders) from the dataset (default: True)
--subheaders_threshold                      Minimum number of subheaders an article should have to be considered a news ticker (default: 10)
--drop_articles_with_forbidden_patterns     Indicate whether to drop articles containing a predefined regular expression from the dataset (default: True)

Requirements

This code is implemented in Python 3. The requirements can be installed from requirements.txt.

pip3 install -r requirements.txt

License

The code is licensed under the MIT License.

Contact

Author: Andreea Iana

Affiliation: University of Mannheim

E-mail: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
news_crawler		news_crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
preprocess_data.py		preprocess_data.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

german-news

Features

Extracted information

Usage

Crawling an outlet

Creating a dataset from scraped articles

Requirements

License

Contact

About

Releases

Packages

Contributors 2

Languages

License

andreeaiana/german-news

Folders and files

Latest commit

History

Repository files navigation

german-news

Features

Extracted information

Usage

Crawling an outlet

Creating a dataset from scraped articles

Requirements

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages