OSCUR data

Overview

This repository presents a comprehensive framework for collecting, documenting, and analyzing urban transportation datasets, with a focus on New York City data sources. It integrates spatial data science theory with practical implementation aligned with U.S. Department of Transportation (DOT) data standards.

This repository tracks the progress of data generation and enhancement. The goal is to generate geospatial datasets across three key categories: Infrastructure, Traveler Behavior/Safety, and Context, to support decision-making tools for multimodal transportation planning. For detailed progress tracking, refer to the GitHub issue.

🤗 All datasets uploaded are available on our HuggingFace Hub.

Repository Structure

OSCUR-data/
├── metadata/                 # YAML specifications describing each data source
├── code/                     # Scripts to download, process, and upload data
│   ├── metadata_generators/     # Generate standardized metadata YAML files
│   ├── downloaders/             # Raw data acquisition from various APIs
│   ├── processors/              # Data cleaning, transformation, and validation
│   └── upload_to_hugging_face/  # Utilities for uploading datasets to Hugging Face
├── data_profiles/            # JSON summaries/statistics of datasets
└── examples/                 # Jupyter notebooks demonstrating dataset usage

Add a New Dataset

To contribute a new dataset to this repository, follow these steps:

1. Metadata

Store dataset metadata as individual YAML files in the metadata/ directory.
Refer to the guide in code/metadata_generators for how to create or generate a metadata file.
Ensure all required metadata fields (e.g., title, description, source, license) are completed.

2. Code

Metadata Generators:
- Add or modify scripts in code/metadata_generators to generate standardized YAML metadata files.
- These scripts can use NYC Open Data APIs or other APIs to extract metadata and save it in the metadata/ directory.
Downloaders:
- Add a Python script that collects raw data from the source to code/downloaders.
- If multiple scripts are needed, create a subdirectory named after the dataset ID (e.g., code/downloaders/your_dataset_id/).
Processors:
- Add a Python script for cleaning, transforming, and validating the data to code/processors.
- If necessary, group related scripts under a folder named after the dataset ID.
Uploader:
- Upload your cleaned dataset to the OSCUR HuggingFace Hub repository.
- Follow the guide in upload_to_hugging_face/README.md for instructions.

3. Data Profile

Generate a profile summary of the dataset (recommended: use datamart-profiler) and save it as a .json file in data_profiles/.

4. Usage Example

Provide a Jupyter notebook demonstrating how to use or visualize the dataset.
Save it to examples/.

Contributing

We welcome contributions to enhance the dataset collection and improve the tools! Please:

Check the GitHub issue for current progress and to avoid duplicating efforts.
Submit pull requests with new datasets, scripts, or documentation updates.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OSCUR data

Overview

Repository Structure

Add a New Dataset

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
code		code
data_profiles		data_profiles
examples		examples
metadata		metadata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

VIDA-NYU/OSCUR-data

Folders and files

Latest commit

History

Repository files navigation

OSCUR data

Overview

Repository Structure

Add a New Dataset

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages