This repository automates downloading, cleaning, aggregating, and ingesting FCC Measuring Broadband America (MBA) raw data into Elasticsearch, plus an operator-identification step.
.
├── data/ # Download & conversion helpers
├── elastic/ # Elasticsearch security assets
├── logging/ # Logs for each stage
├── filter.py # Filter, timezone-adjust, classify on/off-net
├── aggregate.py # Compute throughput, latency, jitter aggregates
├── ingest.py # Bulk-load aggregated Parquet into Elasticsearch
├── operator_identification.py # Map each unit to its ISP and technology
├── run.sh # One/Multi-year pipeline: filter → aggregate → ingest
├── identify_operators.sh # One/Multi-year operator-ID pass
├── run_all.sh # Loop over all years, choose mode via `--op-id`
├── constants.py # File names, filter rules, hostname mappings
└── requirements.txt # Python dependencies
- Python 3.8+
- Bash shell
- Install dependencies:
pip install -r requirements.txt-
Download & convert Place raw CSVs under
data/<YYYYMM>(seedata/README.md) -
Operator Identification Independently classify each unit_id by ISP and technology:
bash identify_operators.sh 2023
Generates
data/2023MM/unit_id_mapping.csvfor all monthsMMfor year2023. -
Run transform & ingest For a single year:
bash run.sh 2023
- filter.py applies QA filters, timezone localization, and on/off-net classification.
- aggregate.py groups by unit_id/month and computes summary metrics.
- ingest.py pushes Parquet to Elasticsearch indices
aggregated-<metric>-<YYYYMM>.
-
Process all years Master orchestrator:
# For operator identification mode bash run_all.sh --op-id # For processing and ingesting data bash run_all.sh
-
filter.py Applies file-specific filters (
constants.FILTER_MAP), fills missing timestamps via unit timezones, and labels peak/off-peak or weekend. -
aggregate.py Aggregates throughput, latency, or jitter data into monthly summaries; outputs Parquet to
processed_data/. -
ingest.py Reads
processed_data/*.parquetand bulk-indexes into Elasticsearch (config inelastic/security.ini). -
operator_identification.py Merges target-hostname classification with unit profiles to build a comprehensive unit→ISP mapping.
-
run.sh / identify_operators.sh Invoke the above in sequence for specified years.
-
run_all.sh Batch-processes all years (2011–2023), with optional
--op-idflag for operator identification only.
All stages write logs under logging/:
filter.logaggregate.logingest.logoperator_identification.logpipeline.log
Monitor these for errors or processing summaries.