-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Musicbrainz reduce memory used by processing by chunks #165
Open
Yueqiao12Zhang
wants to merge
47
commits into
main
Choose a base branch
from
musicbrainz-reduce-memory-used
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 41 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
aca1887
refactor: more ignored keys
Yueqiao12Zhang 110263b
refactor: avoid new allocation
Yueqiao12Zhang f9d2091
doc: explain if statement
Yueqiao12Zhang f72eb1e
refactor: extract and convert in chunks
Yueqiao12Zhang 48adb26
fix: write header
Yueqiao12Zhang 87d9fe5
refactor: if not first level, then don't extract name
Yueqiao12Zhang 6659d28
refactor: refresh values list
Yueqiao12Zhang e79de48
Update convert_to_csv.py
Yueqiao12Zhang 6c565be
Revert "Update convert_to_csv.py"
Yueqiao12Zhang 088c40d
refactor: read jsonl in chunks
Yueqiao12Zhang 9ff292f
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang 2cbea9b
test: add print tests
Yueqiao12Zhang 7ae3914
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang 04b3c41
fix: delete buggy code
Yueqiao12Zhang 8203c71
style: delete print test statements
Yueqiao12Zhang 6e51f96
fix: memory bug and header writting
Yueqiao12Zhang e23ed58
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang 740d365
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang c0d1c78
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang 6c6f920
test: delete unused test files
candlecao 99ff276
Update genre.csv
Yueqiao12Zhang e757df8
fix: header problem by making a temp csv
candlecao 805c464
Update mapping.json
candlecao 81f34e7
mapping: full columns mapping
candlecao 92bfaf1
mapping: fill the empty mappings for updated CSVs
Yueqiao12Zhang 581e86a
refactor: ignore iso codes since they are duplicated info
Yueqiao12Zhang 6422088
refactor: recognize special math character
Yueqiao12Zhang 89371de
doc: add made-up URLs
Yueqiao12Zhang e93147e
optimize: simplify input of convert_to_csv.py
Yueqiao12Zhang 8dc8814
fix: syntax error
Yueqiao12Zhang 397e95e
fix: entity type is the last part of file path
Yueqiao12Zhang 0759de7
fix: output pathname
candlecao 10f62c2
doc: update manual based on changes that removed commandline arguments
Yueqiao12Zhang 738a7ed
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
candlecao ceffbf1
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang 66133e8
style: delete unused test code
Yueqiao12Zhang 7cb07d0
doc: update docstring input to the correct number
Yueqiao12Zhang 3f61a23
refactor: use writer instead of f.write()
Yueqiao12Zhang 5a1783b
Revert "Update mapping.json"
Yueqiao12Zhang d47751e
doc: style update according to GPT
Yueqiao12Zhang 3bad21e
doc: add specification for chunk_size
Yueqiao12Zhang 6231da5
refactor: change list to set
Yueqiao12Zhang 29dc943
fix: filename bug, readlines bug, opening wrong file bug
Yueqiao12Zhang 28243a5
fix: auto escape by using \t not comma
Yueqiao12Zhang 37201ab
feat: add quotechar to distinguish ", and , separator
Yueqiao12Zhang 72b92e6
refactor: temp should be a tsv
Yueqiao12Zhang b64da8a
fix: correct a few temp.csv error
Yueqiao12Zhang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,68 @@ | ||
# 1: The procedure: | ||
* Since all ids and Wikidata links are already reconciled in the conversion process, there's no need to turn to OpenRefine. | ||
- Steps: | ||
1. Navigate to ```linkedmusic-datalake/musicbrainz/csv``` folder. | ||
1. Run ```python3 fetch.py``` to get the latest tar.xz files from the MusicBrainz public data dumps into the local ```data/raw/``` folder. | ||
2. Run ```python3 untar.py``` to unzip the files and extract the jsonl files needed into the local ```data/raw/extracted_jsonl/mbdump/``` folder. | ||
2. Run convert_to_csv.py, specify the JSON file in the first argument and the entity type in the second argument. | ||
* Example command line: | ||
```python3 convert_to_csv.py data/raw/extracted_jsonl/mbdump/area area``` | ||
3. A CSV file named by its entity type will be generated in ```data/output/``` folder. It can be used for further operations. | ||
|
||
# 2: The data details: | ||
- In the link provided as below(It provides 2 versions, we usually choose the latest such as 20240626-001001/), we can download archived files which end with suffix ".tar.xz". If we unzip any of them, we will see a "mbdump" folder with a file named by its entity type and without extension. This is the dump in "JSON Lines" format. Each line represents one record in the dump. | ||
- The name of the file is just the type of the entity(the class of an instance) in the database. For example, there are types such as area, artist, event, instrument, label, place etc. | ||
- In every line, there must be an attribute named "id", which is the primary key of each record. When converted into CSV, we rename the id according to "{entity_type}_id" format to be more precise of which entity type we are working with. | ||
- During the conversion process, for all ids of different entity type (genre_id, artist_id, area_id, etc.), we add the MusicBrainz reference to the id in the format: "https://musicbrainz.org/{entity_type}/{id}". It automatically converts the id to a URI reference. | ||
- For any record, if it is reconciled with Wikidata link by MusicBrainz bots, then it should have an object in "relations" > "resources" > "url", with the Wikidata link as the value. If it exists, then it is extracted to the CSV file. | ||
|
||
# DEPRECATED | ||
# 3: As an experiment data sets: | ||
- For experiment purposes, you had better only use a small portion of each data dump: | ||
- So, use the command of bash(for example, extract 3000 entries of an entity), please find the "mbdump" folder and open the terminal at the folder, then exectute: | ||
head -n 3000 "area">"test_area" | ||
to get the first 3000 lines from the area data dumps. | ||
- All other data dumps perform the same procedure. | ||
## 1: Procedure | ||
|
||
### Prerequisites: | ||
- All IDs and Wikidata links are already reconciled during the conversion process, eliminating the need for OpenRefine. | ||
|
||
### Steps: | ||
1. **Navigate to the target folder:** | ||
- Go to the `linkedmusic-datalake/musicbrainz/csv` directory. | ||
|
||
2. **Fetch the latest data:** | ||
- Run the following command to download the latest tar.xz files from the MusicBrainz public data dumps: | ||
```bash | ||
python3 fetch.py | ||
``` | ||
- The files will be saved in the local `data/raw/` folder. | ||
|
||
3. **Extract the required files:** | ||
- Unzip and extract the necessary JSON Lines (jsonl) files by running: | ||
```bash | ||
python3 untar.py | ||
``` | ||
- The extracted files will be located in the `data/raw/extracted_jsonl/mbdump/` folder. | ||
|
||
4. **Convert data to CSV:** | ||
- Execute the conversion script: | ||
```bash | ||
python3 convert_to_csv.py | ||
``` | ||
- This will generate a CSV file, named according to its entity type, in the `data/output/` folder. | ||
|
||
5. **Output:** | ||
- The generated CSV files are ready for further processing. | ||
|
||
--- | ||
|
||
## 2: Data Details | ||
|
||
### Overview: | ||
- The data can be downloaded from the provided link, typically selecting the latest version (e.g., `20240626-001001/`). | ||
- The downloaded `.tar.xz` files contain a `mbdump` folder with files named by entity type (e.g., `area`, `artist`, `event`, `instrument`, `label`, `place`). Each file is in "JSON Lines" format, with each line representing a single record. | ||
|
||
### Important Notes: | ||
- **ID Attributes:** Each record has an `id` attribute, which serves as the primary key. When converting to CSV, this `id` is renamed to `{entity_type}_id` for clarity. | ||
- **URI Conversion:** All IDs (e.g., `genre_id`, `artist_id`, `area_id`) are converted to URIs in the format: `https://musicbrainz.org/{entity_type}/{id}`. | ||
- **Wikidata Links:** If a record is linked to a Wikidata entry by MusicBrainz bots, the link can be found under `"relations" > "resources" > "url"`. These are also extracted into the CSV. | ||
|
||
--- | ||
|
||
## 3: Mapping | ||
|
||
### Custom Predicate URLs: | ||
- The following made-up predicate URLs are used in the data conversion: | ||
- `"packaging"`: `https://musicbrainz.org/packaging` | ||
- `"packaging-id"`: `https://musicbrainz.org/packaging` | ||
- `"media_pregap_id"`: `https://musicbrainz.org/pregap` | ||
- `"media_discs_id"`: `https://musicbrainz.org/disc` | ||
|
||
--- | ||
|
||
## Deprecated: Experiment Data Sets | ||
|
||
### Experiment Guidelines: | ||
- For experimental purposes, it is recommended to use a small portion of each data dump. | ||
- Use the following bash command to extract the first 3000 entries of a specific entity (e.g., `area`): | ||
```bash | ||
head -n 3000 "area" > "test_area" | ||
``` | ||
- Apply the same process to other data dumps if needed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).
probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key can be longer and only contains the ignored string. We don't want it.