Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Currently Missing Datasets #40

Open
61 of 77 tasks
ap0nia opened this issue Jun 8, 2023 · 23 comments
Open
61 of 77 tasks

Currently Missing Datasets #40

ap0nia opened this issue Jun 8, 2023 · 23 comments

Comments

@ap0nia
Copy link

ap0nia commented Jun 8, 2023

A list of dataset files we believe are missing. Will be updated as they're reported / found. Feel free to comment to report additional ones.

  • 108, Waveform Database Generator (Version 2)
    Version 1, don't know where to find version 2
  • 346, Educational Process Mining (EPM): A Learning Analytics Data Set
  • 375, Miskolc IIS Hybrid IPS
  • 533, Detect Malware Types
    Popular dataset
  • 534, Wave Energy Converters
  • 538, Sattriya_Dance_Single_Hand_Gestures Dataset
  • 546, Vehicle routing and scheduling problems
  • 548, Breath Metabolomics
  • 564, Unmanned Aerial Vehicle (UAV) Intrusion Detection
  • 568, Refractive Errors
  • 578, Intelligent Media Accelerometer and Gyroscope (IM-AccGyro) Dataset
  • 584, Labeled Text forum threads dataset
  • 595/599, LastFM Asia Social Network
  • 596, Wheat Kernels
  • 611, Hierarchical Sales Data
  • 613, Smartphone Dataset for Anomaly Detection in Crowds
  • 16, Breast Cancer Wisconsin (Prognostic)
  • 17, Breast Cancer Wisconsin (Diagnostic)
    Only has a Graphics folder in its zip file.
    Located with the original breast-cancer-wisconsin dataset files prefixed with wdbc
  • 20, census income
    Duplicate of Adult dataset
  • 28, Japanese Credit Screening
  • 75, Musk (Version 2)
  • 84, Prodigy
  • 91, Soybean (Small)
  • 96, SPECTF Heart
  • 98, Statlog Project
  • 100, Teaching Assistant Evaluation
  • 116, US Census data 1990
  • 117, Census Income KDD
  • 143, Statlog Australian Credit Approval
  • 145, Statlog Heart
  • 146, Statlog Landsat Satellite
  • 147, Statlog Image Segmentation
  • 149, Statlog Vehicle Silhouettes
  • 150, Connectionist Bench Nettalk Corpus
    datasets 150-155 found under machine-learning-databases/undocumented
  • 151, Connectionist Bench Sonar
  • 152, Connectionist Bench Vowel Recognition Deterding Data
  • 153, Economic Sanctions
  • 154, Protein Data
  • 155, Cloud
  • 157, Dodgers Loop Sensor
  • 163, Reuters Transcribed Subset
  • 241, One-hundred plant species leaves data set
  • 271, Activities of Daily Living (ADLs) Recognition Using Binary Sensors
  • 285, Wilt
  • 286, User Identification From Walking Activity
  • 288, Leaf
  • 295, Urban Land Cover
  • 315, Geographical Origin of Music
  • 341, smartphone+based+recognition+of+human+activities+and+postural+transitions
  • 316, Condition Based Maintenance of Naval Propulsion Plants
  • 329, Diabetic retinopathy debrecen
  • 358, Improved Spiral Test Using Digitized Graphics Tablet for Monitoring Parkinson's Disease
  • 384, Quality Assessment of Digital Colposcopies
  • 400, Crowdsourced Mapping
  • 406, Anuran Calls MFCCS
  • 415, DSRC Vehicle Communications
  • 416, mturk+user+perceived+clusters+over+images
  • 419, Autistic Spectrum Disorder Screening Data for Children. Wrong zip file name.
  • 420, Autistic Spectrum Disorder Screening Data for Adolescent. Wrong zip file name.
  • 426, Autism Screening Adult. Wrong zip file name.
  • 452, GNFUV Unmanned Surface Vehicles Sensor Data
  • 459, Avila
  • 483, Behavior of the urban traffic of the city of Sao Paulo in Brazil
  • 496, Alcohol QCM Sensor Dataset
  • 524, Swarm Behavior
    Folder exists directly under dataweb2/ml, but I don't have permission to access it
  • 535, Youtube cookery channels viewers comments in Hinglish
  • 539, Divorce Predictors (same as 497)
  • 543, User Profiling and Abusive Language Detection Dataset
  • 544, Estimation of Obesity Levels
    Found at dataweb2/ml/files
  • 545, Rice (Cammeo and Osmacik)
    Recovered from Kaggle
  • 553, clickstream data for online shopping
  • 559, Bar Crawl Detection Heavy Drinking
  • 563, Iranian Churn dataset
  • 572, Taiwanese Bankruptcy Prediction
    Recovered from Kaggle
  • 594, Shoulder Implant Manufacture Classification
  • 598, Multi-view Brain Networks
  • 608, traffic+flow+forecasting. This should be the same as 734, traffic+flow+forecasting+1
@rkost
Copy link

rkost commented Jul 18, 2023

Dataset 148 currently only links to a zip archive with no data but one empty folder called Graphics. The downloaded archive is only 116 bytes in size.

@ap0nia
Copy link
Author

ap0nia commented Jul 18, 2023

Thank you for informing us. This should be rectified now. I can find 4 files in the downloaded zip file. Please let us know if this isn't the case for you, thanks!

Link to the dataset page: https://archive.ics.uci.edu/dataset/148/statlog+shuttle

@rkost
Copy link

rkost commented Jul 19, 2023

Hey, thanks so much for the fast response! Yes, all 4 files are there.

I wonder if it is intentional that the training data is still compressed after unzip-ing the downloaded archive while the test data is not? One can get the original data by running uncompress shuttle.trn.Z on any unix, not sure about windows users.

Edit. Ah, just saw that the index file also lists the training data as a compressed file, disregard then :)

@markellekelly
Copy link

Same issue with Census Income (#20)—the zip only contains a "Graphics" folder

@ptruong0
Copy link

Hi Markelle, the abstract of the Census Income dataset says that it is the same as the Adult dataset. We can either copy the Adult files to the Census Income dataset, or remove Census Income altogether. How should we handle this?

@markellekelly
Copy link

Since this dataset is well-known under both names, let's have the data available under both for now (i.e., go ahead and copy the Adult files)—we can discuss combining the two later. thanks!

@maxxu05
Copy link

maxxu05 commented Sep 4, 2023

@ptruong0
Copy link

ptruong0 commented Sep 4, 2023

@maxxu05 Fixed, thanks for letting us know.

@jundsp
Copy link

jundsp commented Sep 20, 2023

There is missing data from Dataset 301 "Parkinson Speech Dataset with Multiple Types of Sound Recordings":
https://archive.ics.uci.edu/dataset/301/parkinson+speech+dataset+with+multiple+types+of+sound+recordings

It used to include a .rar file that contained the audio files (~20 mb). But not only includes a couple of text files. For example, this snapshot from 2015 shows the full dataset:
https://web.archive.org/web/20150208025709/http://archive.ics.uci.edu/ml/machine-learning-databases/00301/

@Wamadahama
Copy link

Dataset 28 - Japanese Credit Screening at https://archive.ics.uci.edu/dataset/28/japanese+credit+screening appears to be missing the dataset, the download contains only an empty Graphics folder.

@AhmedGHDev
Copy link

Dataset 84 [Prodigy] currently only links to a zip archive with no data but one empty folder called Graphics.

@AhmedGHDev
Copy link

AhmedGHDev commented Oct 21, 2023

Dataset 157 [Dodgers Loop Sensor] currently only links to a zip archive with no data but one empty folder called Graphics with two images (the images for the dataset).

just to mention that the file https://archive.ics.uci.edu/static/public/156/calit2+building+people+counts.zip contains 6 files. I think two of them belong to [Dodgers Loop Sensor] dataset, which are:

  • Dodgers.data
  • Dodgers.events

@AhmedGHDev
Copy link

Dataset 75 [Musk (Version 2)] currently only links to a zip archive with no data but one empty folder called Graphics.

just to mention that the file https://archive.ics.uci.edu/static/public/74/musk+version+1.zip contains 7 files. I think three of them belong to [Musk (Version 2] dataset, which are:

  • clean2.data.Z
  • clean2.info
  • clean2.names

@AhmedGHDev
Copy link

AhmedGHDev commented Oct 21, 2023

Dataset 91 [Soybean (Small)] currently only links to a zip archive with no data but one empty folder called Graphics.

just to mention that the file https://archive.ics.uci.edu/static/public/90/soybean+large.zip contains 12 files. I think two of them belong to [Soybean (Small)] dataset, which are:

  • soybean-small.data
  • soybean-small.names

@AhmedGHDev
Copy link

Dataset 96 [SPECTF Heart] currently only links to a zip archive with no data but one empty folder called Graphics.

just to mention that the file https://archive.ics.uci.edu/static/public/95/spect+heart.zip contains 8 files. I think two of them belong to [SPECTF Heart] dataset, which are:

  • SPECTF.train
  • SPECTF.test

@AhmedGHDev
Copy link

AhmedGHDev commented Oct 21, 2023

Another question please,
The website currently contains 657 datasets, but the dataset ID reaches 892
Is there private datasets?

@ptruong0
Copy link

When datasets are donated, they have to be approved by admins. There are currently 657 approved datasets, and 892 datasets in total including pending & rejected datasets.

@mirfan83
Copy link

Hello,
The datasset 613, Smartphone Dataset for Anomaly Detection in Crowds is also missing.

Thanks.

@rlongjohn
Copy link

Also missing: Connectionist Bench (Sonar, Mines vs. Rocks)

@markellekelly
Copy link

We used to have the PIMA Indians dataset (many other websites, e.g., Kaggle attribute it to us), not sure what happened to it

@ptruong0
Copy link

ptruong0 commented Apr 3, 2024

@markellekelly The owners of the PIMA dataset replaced the files with a note.txt that says "Thank you for your interest in the Pima Indians Diabetes dataset. The dataset is no longer available due to permission restrictions."

@evelina-crypto
Copy link

i also cannot access my dataset and get "DatasetNotFoundError: Error reading data csv file for "Cirrhosis Patient Survival Prediction" dataset (id=878)."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests