Civic Hack DC 2025 – Part 1: Making Public Comments Count – Outcomes & Archive

Note: This repository was created with the aid of Cursor using Claude 4 Sonnet, the content was workshopped by ChatGPT o3 and edited by ChatGPT 4o

🌟 Acknowledgments

We first want to thank the volunteers and participants who made this hackathon possible. The amount of work that went into this was incredible, and we are grateful for the time and effort that everyone put in.

Partners

Moravian University(https://www.moravian.edu/) - Maintaining the Regulations.gov mirror dataset
DataKindDC(https://www.meetup.com/datakind-dc/)

This hackathon was made possible by our sponsors, partners, the evaluation panel, and all participants who dedicated their time and creativity to building tools for regulatory transparency.

0 · Final Results

Table of results

Top-Ranked Projects (Weighted Score ≥ ~3.0) by Alphabetical Order

Can of Spam

Breakdown: Median Scores: Impact 4, Novelty 3, Amplification 3, Open Source 3, Usability 3, Continuity 3
Highlights: Built a proof-of-concept for duplicate and bot comment detection, laying groundwork for campaign analysis.

Hive-Partitioned Parquet

Breakdown: Median Scores: Impact 4.5, Novelty 4, Amplification 4, Open Source 4, Usability 3, Continuity 3.5
Highlights: Converted Mirrulations JSON to Hive-partitioned Parquet, enabling efficient SQL querying for large datasets.

LLM.gov

Breakdown: Median Scores: Impact 5, Novelty 4, Amplification 4, Open Source 4, Usability 4, Continuity 4
Highlights: Combined Retrieval-Augmented Generation (RAG) with vector embeddings for semantic search and summarization. Well-documented.

Mirrulations-CLI

Breakdown: Median Scores: Impact 5, Novelty 3.5, Amplification 5, Open Source 5, Usability 4.5, Continuity 4
Highlights: Packaged Mirrulations fetch and CSV tools into a polished CLI for easy installation and use, greatly improving accessibility.

Rules-Talk

Breakdown: Median Scores: Impact 4, Novelty 4, Amplification 4.25, Open Source 3.25, Usability 4, Continuity 3.5
Highlights: Used LLMs to cluster sentiment and topics in public comments, making regulatory feedback more digestible for non-technical stakeholders.

Taskmasters

Breakdown: Median Scores: Impact 3.5, Novelty 3, Amplification 3, Open Source 2.5, Usability 2.5, Continuity 3.25
Highlights: Developed a pipeline for comments, extracting text from PDFs, DOCX, and images, addressing data quality challenges.

The Scrapers

Breakdown: Median Scores: Impact 4.5, Novelty 3.5, Amplification 3.75, Open Source 3, Usability 3, Continuity 3.25
Highlights: Developed FCC and SEC scraping workflows, expanding the Mirrulations dataset beyond Regulations.gov.

Within-Docket Dataset

Breakdown: Median Scores: Impact 4, Novelty 3, Amplification 4, Open Source 3, Usability 2, Continuity 3
Highlights: Created a conceptual pipeline for mapping comment influence on regulatory changes.

Judges

Ben Coleman – LinkedIn Professor of Computer Science at Moravian University and primary maintainer of the Mirrulations project.
Fred Trotter – LinkedIn Healthcare Data Technologist at CMS Digital Service; healthcare informatics and open data expert.
Evan Tung – LinkedIn Software Engineer at AWS and Civic Tech DC organizer.
Gautami Nadkarni – LinkedIn Senior Customer Engineer at Google Cloud, focused on AI/ML and data modernization.
Santhosh Kumar Veeramalla – LinkedIn Senior Scala Developer at Optum with deep expertise in Spark and data engineering.
Melanie Kourbage – LinkedIn Lead Specialist at APHL; veteran in public health informatics and federal-state data systems.
Taylor Wilson – LinkedIn VP of Applied Statistics at Reveal Global Consulting, leads DataKind DC.
Michael Deeb – LinkedIn Principal Consultant at TealWolf, CTO of Keeplist.io, and Director at Civic Tech DC.

1 · Why This Repository Exists

On July 26 2025, 80 policy experts, data engineers, and civic technologists gathered at Taoti Creative to build open-source tools that unlock public-comment data from Regulations.gov and agency-specific portals.

This repo now serves as the permanent archive:

final project snapshots
evaluation results & slide decks
cleaned datasets & helper scripts
the original problem briefs (for future contributors)

Event Photos

Photo Albums:

Photos by Alex Gurvich, Dean Eby, and the Civic Tech DC team: Album

Everything is licensed to encourage reuse and continuation.

2 · Key Dataset – Mirrulations 🌊

Mirrulations (MIRRor of regULATIONS.gov) is a comprehensive ecosystem developed by Moravian University Computer Science to ingest, process, store, and serve U.S. federal regulatory data from Regulations.gov. It provides a robust, scalable, and accessible way for researchers, developers, and the public to interact with this complex dataset. The system overcomes the API’s 1,000 items/hour limit by using donated API keys to maintain a continuously updated mirror — about 27 million items — including text extracted from PDFs.

Item	Details
Bucket	`s3://mirrulations` (AWS Open-Data)
Size	≈ 2.3 TB / 27 M items (JSON + attachments)
Docs	https://github.com/awslabs/open-data-registry/blob/main/datasets/mirrulations.yaml
CLI	`mirrulations-fetch`, `mirrulations-query`, `mirrulations-csv`
Contact	Prof. Ben Coleman • [email protected]

Mirrulations mirrors Regulations.gov hourly, bypassing the API’s 1 000-items/h throttle and extracting text from PDFs so teams can gulp data at scale.

There is a sample slice of the dataset in this repo.

3 · What were the problem statements?

Track	Problem - How can we ...
Entity Resolution	identify and unify organization names across dockets, accounting for aliases and inconsistent naming conventions?
Campaign Detection	detect duplicate or template-driven comment submissions, including coordinated campaigns and potential bot activity?
Position & Sentiment Analysis	extract nuanced positions and sentiments from comments beyond simple for/against categorizations?
Influence Mapping	link public comments to specific changes in final rules and identify which commenters influenced regulatory outcomes?
Docket-Level Analysis	build clear, digestible summaries and insights from tens of thousands of comments within a single docket?
Cross-Docket Analysis	map related dockets (RFI → Proposed Rule → Final Rule) and enable search across multiple agencies and rulemaking cycles?
Data Accessibility	make the mirrored Regulations.gov dataset easier to explore and analyze for researchers and non-technical stakeholders?
Agency-Specific Data	scrape, integrate, and standardize public comment data from non-Regulations.gov portals (e.g., FCC, SEC, FERC)?
Usability for Non-Technical Users	create interfaces, visualizations, or summaries that make complex regulatory data understandable to advocates, journalists, and the public?
Regulatory Document Navigation	surface and summarize the most relevant sections of lengthy, technical regulatory documents to support timely public engagement?

For more details, see the Problem Space Documentation.

4 · What Was Built

Track	Team Name	Description
Campaign Detection	CanOfSpam	A data analysis tool for detecting fraudulent bot comments in federal regulatory rule dockets using temporal patterns, submission metadata, and content analysis. Identifies coordinated manipulation campaigns through statistical analysis of comment timing bursts and duplicate detection. Built with Python and Marimo notebooks.
Cross-Docket Analysis & Influence Mapping	Within Docket Dataset	Links public comments to specific regulatory documents they respond to within a single docket, using metadata analysis, time-window heuristics, and semantic similarity techniques. Helps understand how public comments influence changes from proposed rules to final rules.
Data Accessibility	Hive-partitioned Parquet	Transforms regulatory data into Hive-partitioned Parquet files for fast and efficient queries using DuckDB. Enables direct querying from S3 with better performance for large-scale regulatory data analysis.
	Mirrulations CLI	Published Python package incorporating scripts from Prof. Ben Coleman to make downloading regulatory data more accessible. Easy to install via pip or use with uvx for streamlined access to Mirrulations data.
	LLM.gov (CMS Docket Assistant)	An LLM wrapper that utilizes RAG queries to answer general questions about dockets. Transforms complex JSON text into machine-readable vector embeddings stored in S3, enabling semantic search and providing a simple chat interface for non-technical users.
Data Quality & Derived Layers	Taskmasters	Extracts data from different document types (PDFs, images, documents) while implementing keyword extraction on comments. Converts JSON files to parquet format using AWS S3, Glue, and Athena services for improved data processing efficiency.
	Team Velogear	A command-line tool written in Go that parses text from PDF files and outputs to CSV, JSON, and Parquet formats. Uses pdftotext from poppler-utils for better accessibility of regulatory documents.
Docket-Level Analysis / Topic & Sentiment	Rules Talk	Policy Comment Analyzer leveraging Google Gemini API to automate analysis of public comments on policy proposals. Extracts key policy information, analyzes comments for specific issues and sentiment, and generates comprehensive reports showing how organizations' critiques and support fit in the conversation.
Entity Resolution	Entity Resolution Team	Extracts and cleans organization information from comments to group submissions together, even when organization names weren't explicitly listed or had inconsistent naming conventions. Used Jupyter Notebooks for analysis.
External Agency Scraping	The Scrapers	Created basic code to scrape other government websites (FCC and SEC) and documented the challenges one may face. Focuses on different scraping methodologies for accessing government data sources.
Regulatory Document Discovery	USPF1	FDA Docket Classification System addressing "Docket Blindness" by automatically analyzing FDA docket comments and generating tags indicating what type of information is needed (Scientific/Technical, Policy/Regulatory, Procedural, etc.).
	Expanded Search	A comprehensive search platform that enables citizens to discover relevant regulatory dockets based on their interests. Features Python backend with spaCy NLP keyword extraction, SQLite database, and Angular frontend with Material UI. (Work in progress)
Topic & Sentiment Analysis	Team Topic Modeling	Automated analysis of regulatory documents to identify regulated topics and extract meaningful keywords. Uses spaCy and RAKE for NLP processing, generates visualizations including bar charts and word clouds for each regulatory topic.

See /projects/ for complete code snapshots, data samples, and demo videos.

5 · Evaluation Criteria

This is not a competition, but we will evaluate the projects asynchronously to decide which projects to feature, and we'll raffle off prizes to the top 3 projects.

Projects were evaluated based on the following dimensions:

Projects are scored on a 1–5 scale in each category and then multiplied by the weight shown below.

Impact & Relevance – Directly tackles an official problem statement and demonstrates clear civic value or policy impact.
Novelty – How unique is the approach? How does it differ from existing tools?
Amplification – What is the potential for this project to be used by others?
Open Source Practices – Public repo with an OSI‑approved license, focus on open-source tool usage, thorough README, install script, contribution guide, and passing tests/CI.
Usability & Design – Non‑technical users can run the tool or interpret results unaided; thoughtful UX or reporting artifacts provided.
Continuity Potential – Road‑map or issues list, maintainers committed, and a deployment or next‑steps plan that makes ongoing work realistic.

📂 Repository Structure

/README.md              – You are here
/LICENSE                – MIT for code, CC-BY 4.0 for docs, CC0 for data samples
/CODE_OF_CONDUCT.md     – Community guidelines
/.github/
    workflows/          – CI for markdown lint and link checking
/docs/                  – Event recap, photos, press, sponsor credits
/docs/problem_space.md  – Problem space and details
/docs/additional_problem_details/ – Additional problem details
    bot_detection.md
    campaign_detection.md
    cross_docket_commenter_threads.md
    docket_mapping.md
    entity_mapping.md
    external_agency_scraping.md
    llm_integration.md
    rule_backlinking.md
    topic_sentiment.md
    download_tools.md
/docs/images/ – Images
/evaluations/
    results.csv         – Raw scoring data
    methodology.md      – Rubric and evaluator details
    summaries.md        – Project descriptions
    submissions/        – Archived individual submissions  
/projects/
    team_name/          – Team project snapshots
        README.md       – Project documentation
        snapshot/       – Code frozen at hackathon
        upstream/       – Live development (submodule)
/datasets/              – Data samples and documentation
/scripts/               – Utility scripts for project maintenance

🛠️ Technical Details

Subtree vs Submodule Strategy

We use git subtrees with squashed history for hackathon snapshots because they:

Preserve working code even if original repos disappear
Allow simple cloning without extra commands
Keep repository size manageable

For active development, teams can optionally add submodules alongside subtrees.

🤝 Post-Hackathon Continuity

This hackathon is only the beginning.

Next Steps: We'll be working on part 2 of the hackathon, which will be announced soon for the fall or spring.
Project Archive: All team projects are archived in /projects as subtrees, preserving a snapshot even if upstream repos disappear.
Ongoing Development: We’ll showcase standout projects at Civic Tech DC meetups and help connect teams with partners to keep building.
Join Us: Stay involved through Civic Tech DC meetups and our Slack (see /docs/ for invite).

📧 Contact

Event Questions: [email protected]
Technical Issues: Open a GitHub issue
Press Inquiries: [email protected]

This repository serves as a permanent archive of Civic Hack DC 2025. For the latest civic tech initiatives in DC, visit civictechdc.org.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
datasets		datasets
docs		docs
evaluations		evaluations
photos		photos
projects		projects
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.lycheecache		.lycheecache
.lycheeignore		.lycheeignore
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
cspell.json		cspell.json
lychee.toml		lychee.toml
project_scores_sentiment.md		project_scores_sentiment.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Civic Hack DC 2025 – Part 1: Making Public Comments Count – Outcomes & Archive

🌟 Acknowledgments

Sponsors

Partners

0 · Final Results

Table of results

Top-Ranked Projects (Weighted Score ≥ ~3.0) by Alphabetical Order

Can of Spam

Hive-Partitioned Parquet

LLM.gov

Mirrulations-CLI

Rules-Talk

Taskmasters

The Scrapers

Within-Docket Dataset

Judges

1 · Why This Repository Exists

Event Photos

2 · Key Dataset – Mirrulations 🌊

3 · What were the problem statements?

4 · What Was Built

5 · Evaluation Criteria

📂 Repository Structure

🛠️ Technical Details

Subtree vs Submodule Strategy

🤝 Post-Hackathon Continuity

📧 Contact

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

civictechdc/hackdc2025

Folders and files

Latest commit

History

Repository files navigation

Civic Hack DC 2025 – Part 1: Making Public Comments Count – Outcomes & Archive

🌟 Acknowledgments

Sponsors

Partners

0 · Final Results

Table of results

Top-Ranked Projects (Weighted Score ≥ ~3.0) by Alphabetical Order

Judges

1 · Why This Repository Exists

Event Photos

2 · Key Dataset – Mirrulations 🌊

3 · What were the problem statements?

4 · What Was Built

5 · Evaluation Criteria

📂 Repository Structure

🛠️ Technical Details

Subtree vs Submodule Strategy

🤝 Post-Hackathon Continuity

📧 Contact

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages