Skip to content

Metacreation-Lab/GigaMIDI-Dataset

Repository files navigation

GitHub Repository of Paper titled "The GigaMIDI dataset with loops and expressive music performance detection"

Summary

Research in artificial intelligence applications in music computing has gained significant traction with the progress of deep learning. Musical instrument digital interface (MIDI) data and its associated metadata are fundamental for the advancement of models that execute tasks such as music generation and transcription with optimal efficiency and high-performance quality. The majority of the public music datasets contain audio data, and symbolic music datasets are comparatively small. However, MIDI data presents advantages over audio, such as providing an editable version of musical content independent of its sonic rendering. MIDI data can be quantized or interpreted with variations in micro-timing and velocity, but there is only a limited amount of metadata and algorithms to differentiate expressive symbolic music data performed by a musician from non-expressive data that can be assimilated into music scores. To address these challenges, we present the GigaMIDI dataset, a comprehensive corpus comprising over 1.43M MIDI files, 5.3M tracks, and 1.8B notes, along with annotations for loops and metadata for expressive performance detection. To detect expressiveness, which tracks reflect human interpretation, we introduced a new heuristic called note onset median metric level (NOMML), which allowed us to identify with 99.5% accuracy that 31% of GigaMIDI tracks are expressive. Detecting loops, or repetitions of musical patterns, presents a challenge when tracks exhibit expressive timing variations, as repeated patterns may not be strictly identical. To address this issue, we mark MIDI loops for non-expressive music tracks, which allows us to identify 7M loops. The GigaMIDI dataset is accessible for research purposes on the Hugging Face hub [https://huggingface.co/datasets/Metacreation/GigaMIDI] in a user-friendly way for convenience and reproducibility.

Repository Layout

/GigaMIDI: Code for creating the full GigaMIDI dataset from source files, and README with example code for loading and processing the data set using the datasets library

/loops_nomml: Source files for loop detection algorithm and expressive performance detection algorithm

/scripts: Scripts and code notebooks for analyzing the GigaMIDI dataset and the loop dataset

/tests: E2E tests for expressive performance detection and loop extractions

Analysis of Evaluation Set and Optimal Threshold Selection including Machine Learning Models: This archive includes CSV files corresponding to our curated evaluation set, which comprises both a training set and a testing set. These files contain percentile calculations used to determine the optimal thresholds for each heuristic in expressive music performance detection. The use of percentiles from the data distribution is intended to establish clear boundaries between non-expressive and expressive tracks, based on the values of our heuristic features. Additionally, we provide pre-trained models in .pkl format, developed using features derived from our novel heuristics. The hyperparameter setup is detailed in the following section titled Pipeline Configuration.

Data Source Links for the GigaMIDI Dataset: Data source links for each collected subset of the GigaMIDI dataset are all organized and uploaded in PDF.

Running MIDI-based Loop Detection

Included with GigaMIDI dataset is a collection of all loops identified in the dataset between 4 and 32 bars in length, with a minimum density of 0.5 notes per beat. For our purposes, we consider a segment of a track to be loopable if it is bookended by a repeated phrase of a minimum length (at least 2 beats and 4 note events)

Loop example

Starter Code

To run loop detection on a single MIDI file, use the detect_loops function

from loops_nomml import detect_loops
from symusic import Score

score = Score("tests\midi_files\Mr. Blue Sky.mid")
loops = detect_loops(score)
print(loops)

The output will contain all the metadata needed to locate the loop within the file. Start and end times are represented as MIDI ticks, and density is given in units of notes per beat:

{'track_idx': [0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 5], 'instrument_type': ['Piano', 'Piano', 'Piano', 'Piano', 'Piano', 'Piano', 'Piano', 'Piano', 'Piano', 'Drums', 'Drums', 'Drums', 'Drums', 'Drums', 'Piano', 'Piano'], 'start': [238080, 67200, 165120, 172800, 1920, 97920, 15360, 216960, 276480, 7680, 195840, 122880, 284160, 117120, 49920, 65280], 'end': [241920, 82560, 180480, 188160, 3840, 99840, 17280, 220800, 291840, 9600, 211200, 138240, 291840, 130560, 51840, 80640], 'duration_beats': [8.0, 32.0, 32.0, 32.0, 4.0, 4.0, 4.0, 8.0, 32.0, 4.0, 32.0, 32.0, 16.0, 28.0, 4.0, 32.0], 'note_density': [0.75, 1.84375, 0.8125, 0.8125, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.8125, 2.46875, 2.4375, 2.5, 0.5, 0.6875]}  

Batch Processing Loops

We also provide a script, main.py that batch extracts all loops in a dataset. This requires that you have downloaded GigaMIDI, see the dataset README for instructions on doing this. Once you have the dataset downloaded, update the DATA_PATH and METADATA_NAME globals to reflect the location of GigaMIDI on your machine and run the script:

python main.py

Instruction for using the code for note onset median metric level (NOMML) heuristic

Install and import Python libraries for the NOMML code:

Imported libraries:

pip install numpy tqdm symusic

Note: symusic library is used for MIDI parsing.

Using with the command line

usage:

python nomml.py [-h] --folder FOLDER [--force] [--nthreads NTHREADS]

Note: If you run the code succesfully, it will generate .JSON file with appropriate metadata.

Pipeline Configuration

The following pipeline configuration was determined through hyperparameter tuning using leave-one-out cross-validation and GridSearchCV for the logistic regression model:

# Hyperparameters
{'C': 0.046415888336127774}

# Logistic Regression Instance
LogisticRegression(random_state=0, C=0.046415888336127774, max_iter=10000, tol=0.1)

# Pipeline
Pipeline(steps=[('scaler', StandardScaler(with_std=False)),
                ('logistic',
                 LogisticRegression(C=0.046415888336127774, max_iter=10000,
                                    tol=0.1))])

Acknowledgement

We gratefully acknowledge the support and contributions that have directly or indirectly aided this research. This work was supported in part by funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Social Sciences and Humanities Research Council of Canada (SSHRC). We also extend our gratitude to the School of Interactive Arts and Technology (SIAT) at Simon Fraser University (SFU) for providing resources and an enriching research environment. Additionally, we thank the Centre for Digital Music (C4DM) at Queen Mary University of London (QMUL) for fostering collaborative opportunities and supporting our engagement with interdisciplinary research initiatives.

Special thanks are extended to Dr. Cale Plut for his meticulous manual curation of musical styles and to Dr. Nathan Fradet for his invaluable assistance in developing the HuggingFace Hub website for the GigaMIDI dataset, ensuring it is accessible and user-friendly for music computing and MIR researchers. We also sincerely thank our research interns, Paul Triana and Davide Rizotti, for their thorough proofreading of the manuscript.

Finally, we express our heartfelt appreciation to the individuals and communities who generously shared their MIDI files for research purposes. Their contributions have been instrumental in advancing this work and fostering collaborative knowledge in the field.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published