Skip to content

labhackercd/fetch-speeches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fetching speeches from Babel

This repository contains a CLI to retrieve speech data from Babel API.

Requirements

  • Python3
  • Latest pip installed

Install

sudo pip install pipenv # Install pipenv on your system
pipenv install # Install all requirements on a virtual environment
pipenv shell # Enter into the virtualenv created before

Usage

speeches.py [OPTIONS] INITIAL_DATE END_DATE

Options:
  -s, --stage TEXT  Initials from speech stage. For example, PE to 'Pequeno
                    Expediente'
  --help            Show this message and exit.
  • INITIAL_DATE and END_DATE must be on yyyy-mm-dd format.

After retrieve and process all speech data in the informed time, this scripts will create a csv called speeches.csv.

Preprocessing

After fetch the speeches that you need, you can perform a preprocessing, removing all numbers, accents, stopwords (also removing all the words that appears on more than 90% of documents and less than 1%) and stemming all tokens from the speeches. To do this follow the instructions:

./pre_process.py

This command will read speeches.csv, generated by the previous script, and generate 4 csv files:

  • stem.csv - list of all stems used (format: id,stem)
  • stemmed-speeches.csv - list of all preprocessed speeches. There will be 2 rows by speech, the first one is the list of stem ids and the second is the frequency of that stem. Both rows are started by the speech ID
  • metadatas.csv - list of all speeches metadatas (format: id,author_name,author_party,author_region,date,updated_at,stage)
  • full-speeches.csv - list of all speeches without any processing (format: id,original)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages