annotations_creators | license | pretty_name | size_categories | source_datasets | tags | task_ids | ||
---|---|---|---|---|---|---|---|---|
|
GigaMIDI |
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Repository: Anonymized during the peer review process
- Point of Contact: Anonymized during the peer review process
The GigaMIDI dataset is a corpus of over 1 million MIDI files covering all music genres.
We provide three subsets: drums-only
, which contain MIDI files exclusively containing drum tracks, no-drums
for MIDI files containing any MIDI program except drums (channel 10) and all-instruments-with-drums
for MIDI files containing multiple MIDI programs including drums. The all
subset encompasses the three to get the full dataset.
The datasets
library allows you to load and pre-process your dataset in pure Python at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset
function.
from datasets import load_dataset
dataset = load_dataset("Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True)
You can load combinations of specific subsets by using the subset
keyword argument when loading the dataset:
from datasets import load_dataset
dataset = load_dataset("Metacreation/GigaMIDI", "music", subsets=["no-drums", "all-instruments-with-drums"], trust_remote_code=True)
Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True
argument to the load_dataset
function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset
dataset = load_dataset(
"Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True, streaming=True
)
print(next(iter(dataset)))
Bonus: create a PyTorch dataloader directly with your own datasets (local/streamed).
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
dataset = load_dataset("Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True, split="train")
batch_sampler = BatchSampler(RandomSampler(dataset), batch_size=32, drop_last=False)
dataloader = DataLoader(dataset, batch_sampler=batch_sampler)
from datasets import load_dataset
from torch.utils.data import DataLoader
dataset = load_dataset("Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True, split="train")
dataloader = DataLoader(dataset, batch_size=32)
MIDI files can be easily loaded and tokenized with Symusic and MidiTok respectively.
from datasets import load_dataset
dataset = load_dataset("Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True, split="train")
The dataset can be processed by using the dataset.map
and dataset.filter
methods.
from pathlib import Path
from datasets import load_dataset
from miditok.constants import SCORE_LOADING_EXCEPTION
from miditok.utils import get_bars_ticks
from symusic import Score
def is_score_valid(
score: Score | Path | bytes, min_num_bars: int, min_num_notes: int
) -> bool:
"""
Check if a ``symusic.Score`` is valid, contains the minimum required number of bars.
:param score: ``symusic.Score`` to inspect or path to a MIDI file.
:param min_num_bars: minimum number of bars the score should contain.
:param min_num_notes: minimum number of notes that score should contain.
:return: boolean indicating if ``score`` is valid.
"""
if isinstance(score, Path):
try:
score = Score(score)
except SCORE_LOADING_EXCEPTION:
return False
elif isinstance(score, bytes):
try:
score = Score.from_midi(score)
except SCORE_LOADING_EXCEPTION:
return False
return (
len(get_bars_ticks(score)) >= min_num_bars and score.note_num() > min_num_notes
)
dataset = load_dataset("Metacreation/GigaMIDI", "all-instruments-with-drums", trust_remote_code=True, split="train")
dataset = dataset.filter(
lambda ex: is_score_valid(ex["music"]["bytes"], min_num_bars=8, min_num_notes=50)
)
A typical data sample comprises the md5
of the file which corresponds to its file name, a music
entry containing dictionary mapping to its absolute file path
and bytes
that can be loaded with symusic
as score = Score.from_midi(dataset[sample_idx]["music"]["bytes"])
.
Metadata accompanies each file, which is introduced in the next section.
A data sample indexed from the dataset may look like this (the bytes
entry is voluntarily shorten):
{
'md5': '0211bbf6adf0cf10d42117e5929929a4',
'music': {'path': '/Users/nathan/.cache/huggingface/datasets/downloads/extracted/cc8e36bbe8d5ec7ecf1160714d38de3f2f670c13bc83e0289b2f1803f80d2970/0211bbf6adf0cf10d42117e5929929a4.mid', 'bytes': b"MThd\x00\x00\x00\x06\x00\x01\x00\x05\x01\x00MTrk\x00"},
'is_drums': False,
'sid_matches': {'sid': ['065TU5v0uWSQmnTlP5Cnsz', '29OG7JWrnT0G19tOXwk664', '2lL9TiCxUt7YpwJwruyNGh'], 'score': [0.711, 0.8076, 0.8315]},
'mbid_matches': {'sid': ['065TU5v0uWSQmnTlP5Cnsz', '29OG7JWrnT0G19tOXwk664', '2lL9TiCxUt7YpwJwruyNGh'], 'mbids': [['43d521a9-54b0-416a-b15e-08ad54982e63', '70645f54-a13d-4123-bf49-c73d8c961db8', 'f46bba68-588f-49e7-bb4d-e321396b0d8e'], ['43d521a9-54b0-416a-b15e-08ad54982e63', '70645f54-a13d-4123-bf49-c73d8c961db8'], ['3a4678e6-9d8f-4379-aa99-78c19caf1ff5']]},
'artist_scraped': 'Bach, Johann Sebastian',
'title_scraped': 'Contrapunctus 1 from Art of Fugue',
'genres_scraped': ['classical', 'romantic'],
'genres_discogs': {'genre': ['classical', 'classical---baroque'], 'count': [14, 1]},
'genres_tagtraum': {'genre': ['classical', 'classical---baroque'], 'count': [1, 1]},
'genres_lastfm': {'genre': [], 'count': []},
'median_metric_depth': [0, 0, 0, 0]
}
The GigaMIDI dataset comprises the MetaMIDI dataset. Consequently, the GigaMIDI dataset also contains its metadata which we compiled here in a convenient and easy to use dataset format. The fields of each data entry are:
md5
(string
): hash the MIDI file, corresponding to its file name;music
(dict
): a dictionary containing the absolutepath
to the downloaded file and the file content asbytes
to be loaded with an external Python package such as symusic;is_drums
(boolean
): whether the sample comes from thedrums
subset, this can be useful when working with theall
subset;sid_matches
(dict[str, list[str] | list[float16]]
): ids of the Spotify entries matched and their scores.mbid_matches
(dict[str, str | list[str]]
): ids of the MusicBrainz entries matched with the Spotify entries.artist_scraped
(string
): scraped artist of the entry;title_scraped
(string
): scraped song title of the entry;genres_scraped
(list[str]
): scraped genres of the entry;genres_discogs
(dict[str, list[str] | list[int16]]
): Discogs genres matched from the AcousticBrainz dataset;genres_tagtraum
(dict[str, list[str] | list[int16]]
): Tagtraum genres matched from the AcousticBrainz dataset;genres_lastfm
(dict[str, list[str] | list[int16]]
): Lastfm genres matched from the AcousticBrainz dataset;median_metric_depth
(list[int16]
):
The dataset has been subdivided into portions for training (train
), validation (validation
) and testing (test
).
The validation and test splits contain each 10% of the dataset, while the training split contains the rest (about 80%).
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Available for research purposes via Hugging Face hub.
The GigaMIDI dataset consists of MIDI files acquired via the aggregation of previously available datasets and web scraping from publicly available online sources. Each subset is accompanied by source links, copyright information when available, and acknowledgments. File names are anonymized using MD5 hash encryption. We acknowledge and cited the work from the previous dataset papers that we aggregate and analyze as part of the GigaMIDI subsets. This data has been collected, used, and distributed under Fair Dealing [ref to country and law copyright act anonymized]. Fair Dealing permits the limited use of copyright-protected material without the risk of infringement and without having to seek the permission of copyright owners. It is intended to provide a balance between the rights of creators and the rights of users. As per instructions of the Copyright Office of [anonymized University], two protective measures have been put in place that are deemed sufficient given the nature of the data (accessible online):
- We explicitly state that this dataset has been collected, used, and is distributed under Fair Dealing [ref to law/country removed here for anonymity].
- On the Hugging Face hub, we advertise that the data is available for research purposes only and collect the user's legal name and email as proof of agreement before granting access.
We thus decline any responsibility for misuse.
To justify the fair use of MIDI data for research purposes, we try to follow the FAIR (Findable, Accessible, Interoperable, and Reusable) principles in our MIDI data collection. These principles are widely recognized and frequently cited within the data research community, providing a robust framework for ethical data management. By following the FAIR principles, we ensure that our dataset is managed responsibly, supporting its use in research while maintaining high standards of accessibility, interoperability, and reusability.
In navigating the use of MIDI datasets for research and creative explorations, it is imperative to consider the ethical implications inherent in dataset bias. MIDI dataset bias often reflects the prevailing practices in Western contemporary music production, where certain instruments, notably the piano and drums, dominate due to their inherent MIDI compatibility. The piano is a primary compositional tool and a ubiquitous MIDI controller and keyboard, facilitating input for a wide range of virtual instruments and synthesizers. Similarly, drums, whether through drum machines or MIDI drum pads, enjoy widespread use for rhythm programming and beat production. This prevalence stems from their intuitive interface and versatility within digital audio workstations. Consequently, MIDI datasets tend to be skewed towards piano and drums, with fewer representations of other instruments, particularly those that may require more nuanced interpretation or are less commonly played using MIDI controllers.
A potential issue with the detected loops in the GigaMIDI dataset arises from the possibility that similar note content may appear, particularly in loop-focused applications. To mitigate this, we implemented an additional deduplication process for the detected loops. This process involved using MD5 checksums based on the extracted music loop content to ensure that identical loops are not provided to users.
Another potential issue with the dataset is the album-effect. When using the dataset for machine learning tasks, a random split may result in data with nearly identical note content appearing in both the evaluation and training splits of the GigaMIDI dataset. To address this potential issue, we provide metadata, including the composer's name, uniform piece title, performer’s name, and genre, where available. Additionally, the GigaMIDI dataset includes a substantial portion of drum grooves, which are single-track MIDI files; such files typically do not contribute to the album-effect.
Lastly, all source data is duly acknowledged and cited in accordance with fair use and ethical standards. More than 50% of the dataset was collected through web scraping and author-led initiatives, which include manual data collection from online sources and retrieval from Zenodo and GitHub. To ensure transparency and prevent misuse, links to data sources for each subset are systematically organized and provided in our GitHub repository, enabling users to identify and verify the datasets used.
The FAIR (Findable, Accessible, Interoperable, Reusable) principles serve as a framework to ensure that data is well-managed, easily discoverable, and usable for a broad range of purposes in research. These principles are particularly important in the context of data management to facilitate open science, collaboration, and reproducibility.
-
Findable: Data should be easily discoverable by both humans and machines. This is typically achieved through proper metadata, traceable source links and searchable resources. Applying this to MIDI data, each subset of MIDI files collected from public domain sources should be accompanied by clear and consistent metadata. For example, organizing the source links of each data subset, as done with the GigaMIDI dataset, ensures that each source can be easily traced and referenced, improving discoverability.
-
Accessible: Once found, data should be easily retrievable using standard protocols. Accessibility does not necessarily imply open access, but it does mean that data should be available under well-defined conditions. For the GigaMIDI dataset, hosting the data on platforms like Hugging Face Hub improves accessibility, as these platforms provide efficient data retrieval mechanisms, especially for large-scale datasets. Ensuring that MIDI data is accessible for public use, while respecting any applicable licenses, supports wider research and analysis in music computing.
-
Interoperable: Data should be structured in such a way that it can be integrated with other datasets and used by various applications. MIDI data, being a widely accepted format in music research, is inherently interoperable, especially when standardized metadata and file formats are used. By ensuring that the GigaMIDI dataset complies with widely adopted standards and supports integration with state-of-the-art libraries in symbolic music processing, such as Symusic (https://github.com/Yikai-Liao/symusic) and MidiTok (https://github.com/Natooz/MidiTok), the dataset enhances its utility for music researchers and practitioners working across different platforms and systems.
-
Reusable: Data should be well-documented and licensed so it can be reused in future research. Reusability is ensured through proper metadata, clear licenses, and documentation of provenance. In the case of GigaMIDI, aggregating all subsets from public domain sources and linking them to the original sources strengthens the reproducibility and traceability of the data. This practice allows future researchers to not only use the dataset but also verify and expand upon it by referring to the original data sources.
In summary, applying FAIR principles to managing MIDI data, such as the GigaMIDI dataset, ensures that the data is organized in a manner that promotes reproducibility and traceability. By clearly documenting the source links of each subset and ensuring the dataset is findable, accessible, interoperable, and reusable, the data becomes a robust resource for the research community.