FCC MBA Ingestion & Processing Pipeline

This repository automates downloading, cleaning, aggregating, and ingesting FCC Measuring Broadband America (MBA) raw data into Elasticsearch, plus an operator-identification step.

Project Structure

.
├── data/                        # Download & conversion helpers
├── elastic/                     # Elasticsearch security assets
├── logging/                     # Logs for each stage
├── filter.py                    # Filter, timezone-adjust, classify on/off-net
├── aggregate.py                 # Compute throughput, latency, jitter aggregates
├── ingest.py                    # Bulk-load aggregated Parquet into Elasticsearch
├── operator_identification.py   # Map each unit to its ISP and technology
├── run.sh                       # One/Multi-year pipeline: filter → aggregate → ingest
├── identify_operators.sh        # One/Multi-year operator-ID pass
├── run_all.sh                   # Loop over all years, choose mode via `--op-id`
├── constants.py                 # File names, filter rules, hostname mappings
└── requirements.txt             # Python dependencies

Prerequisites

Python 3.8+
Bash shell
Install dependencies:

  pip install -r requirements.txt

End-to-End Pipeline

Download & convert Place raw CSVs under data/<YYYYMM> (see data/README.md)
Operator Identification Independently classify each unit_id by ISP and technology:
```
bash identify_operators.sh 2023
```
Generates data/2023MM/unit_id_mapping.csv for all months MM for year 2023.
Run transform & ingest For a single year:
```
bash run.sh 2023
```
- filter.py applies QA filters, timezone localization, and on/off-net classification.
- aggregate.py groups by unit_id/month and computes summary metrics.
- ingest.py pushes Parquet to Elasticsearch indices aggregated-<metric>-<YYYYMM>.

Process all years Master orchestrator:

# For operator identification mode
bash run_all.sh --op-id
# For processing and ingesting data
bash run_all.sh

Scripts Overview

filter.py Applies file-specific filters (constants.FILTER_MAP), fills missing timestamps via unit timezones, and labels peak/off-peak or weekend.
aggregate.py Aggregates throughput, latency, or jitter data into monthly summaries; outputs Parquet to processed_data/.
ingest.py Reads processed_data/*.parquet and bulk-indexes into Elasticsearch (config in elastic/security.ini).
operator_identification.py Merges target-hostname classification with unit profiles to build a comprehensive unit→ISP mapping.
run.sh / identify_operators.sh Invoke the above in sequence for specified years.
run_all.sh Batch-processes all years (2011–2023), with optional --op-id flag for operator identification only.

Logging

All stages write logs under logging/:

filter.log
aggregate.log
ingest.log
operator_identification.log
pipeline.log

Monitor these for errors or processing summaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FCC MBA Ingestion & Processing Pipeline

Project Structure

Prerequisites

End-to-End Pipeline

Scripts Overview

Logging

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
analysis		analysis
data		data
elastic		elastic
unit_profile_files		unit_profile_files
.gitignore		.gitignore
LOGIC.md		LOGIC.md
README.md		README.md
SETUP.md		SETUP.md
aggregate.py		aggregate.py
constants.py		constants.py
filter.py		filter.py
helpers.py		helpers.py
identify_operators.sh		identify_operators.sh
ingest.py		ingest.py
operator_identification.py		operator_identification.py
requirements.txt		requirements.txt
run.sh		run.sh
run_all.sh		run_all.sh

UCBoulder/bclear-internal

Folders and files

Latest commit

History

Repository files navigation

FCC MBA Ingestion & Processing Pipeline

Project Structure

Prerequisites

End-to-End Pipeline

Scripts Overview

Logging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages