This repository contains various datasets for data analysis, machine learning, and educational purposes. Below is a brief description of each dataset available in this repository.
- Go to a csv file in a repository of your choice
- From the top right bar just above the file section, select and click on "Raw" button
- A page will appear with comma separated data with no styling
- Copy the page url
- Make a folder in your desktop
- Open that folder in your favourite code editor and make a simple python file inside the folder. Name it as you please.
- Copy this code [From the section below]
- Run the python file
- The csv file will get downloaded within sometime, depending upon file size
- Now you are ready the use it locally!!
import requests
import pandas as pd
url = '{(copied url here)}'
res = requests.get(url, allow_redirects=True)
with open('download_file_name.csv','wb') as file:
file.write(res.content)
download_file_name = pd.read_csv('download_file_name.csv')
- Contains Body Mass Index (BMI) data.
- Useful for health and fitness analysis.
- Contains department-related information.
- Useful for organizational data processing.
- Contains employee details.
- Can be used for HR analytics and workforce management.
- Classic Iris dataset for machine learning.
- Contains different species of iris flowers with their measurements.
- Contains item similarity data.
- Useful for recommendation system development.
- Dataset containing information about movies.
- Useful for movie recommendation models.
- Contains music genre classification data.
- Can be used for genre prediction models.
- Not a database it's for AVR custom Marker
- Sample dataset for practicing pandas library operations.
- Useful for learning data manipulation.
- Another dataset for pandas tutorials.
- Contains structured data for training purposes.
- Contains user ratings for various items.
- Useful for collaborative filtering and recommendation systems.
- A sample dataset.
- Can be used for testing and learning purposes.
- A test dataset.
- Used for validation and experimentation.
Explore More Datasets on my Kaggle
These datasets can be used for:
- Machine learning projects
- Data analysis and visualization
- Educational and tutorial purposes
If you have additional datasets to contribute, feel free to upload them and update this README with the necessary descriptions.
These datasets are provided for educational and research purposes. Please check individual datasets for any specific license information.
For any questions or suggestions, feel free to raise an issue or contact Lovnish Verma.
A list of public datasets for machine learning, AI, data science, and analytics projects.
- UCI Machine Learning Repository β Classic datasets used in academic ML research.
- Kaggle Datasets β User-contributed datasets with competitions and notebooks.
- Google Dataset Search β Dataset-specific search engine.
- AWS Open Data Registry β Public datasets hosted on AWS.
- Microsoft Azure Open Datasets β Curated datasets for training on Azure.
- OpenML β Collaborative platform for sharing datasets and experiments.
- Papers with Code β Datasets β ML benchmarks tied to research papers.
- Hugging Face Datasets β NLP, vision, and multimodal datasets.
- Zenodo β Scientific datasets with citation support.
- Figshare β Open-access research datasets.
- Data World β Community platform for data sharing.
- Awesome Public Datasets (GitHub) β Curated list across domains.
- FiveThirtyEight Data β Datasets used in data journalism.
- Quandl β Financial and economic data.
- India AI β Dataset Repository β Indian AI project datasets.
- Data.gov.in β Indian government open data.
- Data.gov (USA) β US federal open datasets.
- EU Open Data Portal β Data from European institutions.
- UK Data Service β Economic and social research datasets (UK).
- Canada Open Government β Datasets from Canada.
- Australia Data Portal β Australian government datasets.
- ImageNet β Large-scale image classification dataset.
- COCO Dataset β Object detection, segmentation, and captioning.
- Open Images Dataset β Annotated image data.
- Stanford Dogs Dataset β Fine-grained image classification.
- Common Crawl β Large-scale web crawl data.
- Wikipedia Dumps β Raw Wikipedia text.
- Project Gutenberg β Public domain books for NLP.
- TREC Question Classification β NLP benchmark dataset.
- PhysioNet β Physiological and clinical data.
- MIMIC-III β ICU medical data (de-identified).
- NIH Biomedical Data β NIH open data portal.
- Cancer Imaging Archive β Medical imaging data for cancer research.
- OpenSLR β Speech recognition datasets.
- LibriSpeech ASR β Audiobook dataset for speech recognition.
- OpenStreetMap (Geofabrik) β Extracts of OSM data.
- Google Open Buildings β Global building footprints.
| Name | Domain | Link |
|---|---|---|
| UCI ML Repo | General | Link |
| Kaggle | General | Link |
| IndiaAI | Govt (India) | Link |
| Data.gov.in | Govt (India) | Link |
| Data.gov | Govt (USA) | Link |
| Data World | General | Link |
| Hugging Face | NLP/ML | Link |
| Papers with Code | Benchmarks | Link |
| Zenodo | Research | Link |
For code integration and automatic downloads, you can often use Python libraries such as:
from datasets import load_dataset
dataset = load_dataset("imdb") # Hugging Face exampleYou can also automate downloads from Kaggle via API:
kaggle datasets download -d username/dataset-nameFeel free to contribute more sources via pull request!