Welcome to the Plenary Protocol Data Pipeline! This repository provides a structured pipeline to process data from plenary logs. Below is an overview of how to set up and run the pipeline, as well as a description of each stage.
This pipeline is fully customizable to fit your requirements:
- Date Range: Adjust the fetching parameters to define the desired time period for plenary minutes.
- Processing Steps: Modify the scripts for separation, preparation, or evaluation as needed.
Start by cloning this repository to your local machine:
git clone <repository-url>
cd <repository-folder>
Install the required dependencies using pip
:
pip install -r requirements.txt
Sometimes you need to install dvc
separately:
pip install dvc
Initialize DVC (Data Version Control) in the repository:
dvc init --no-scm
Run the entire pipeline by executing:
dvc repro
📥 Description: Downloads plenary minutes for a defined time period.
- Input: Date range parameters.
- Output: A raw CSV file containing plenary minutes.
🔍 Description: Splits the contents of the plenary protocols into smaller units, such as speeches or speaker contributions.
- Output: A CSV file with separated units.
🩹 Description: Processes the split units further with steps like:
- Data cleansing 🩼
- Formatting 🖍
- Normalization 🔄
- Output: Cleaned and processed data in a CSV format.
📊 Description: Evaluates the prepared data for:
- Analysis 🔍 e.g. wordcount, unique values ✅
- Input: Processed data from the preparation stage.
- Output: Evaluation reports and analysis stored in the
ressources
folder.
🎉 Enjoy processing your plenary data!
Feel free to contribute, suggest improvements, or raise issues. 🙌