Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musicbrainz reduce memory used by processing by chunks #165

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
aca1887
refactor: more ignored keys
Yueqiao12Zhang Aug 19, 2024
110263b
refactor: avoid new allocation
Yueqiao12Zhang Aug 19, 2024
f9d2091
doc: explain if statement
Yueqiao12Zhang Aug 19, 2024
f72eb1e
refactor: extract and convert in chunks
Yueqiao12Zhang Aug 19, 2024
48adb26
fix: write header
Yueqiao12Zhang Aug 19, 2024
87d9fe5
refactor: if not first level, then don't extract name
Yueqiao12Zhang Aug 19, 2024
6659d28
refactor: refresh values list
Yueqiao12Zhang Aug 19, 2024
e79de48
Update convert_to_csv.py
Yueqiao12Zhang Aug 20, 2024
6c565be
Revert "Update convert_to_csv.py"
Yueqiao12Zhang Aug 20, 2024
088c40d
refactor: read jsonl in chunks
Yueqiao12Zhang Aug 20, 2024
9ff292f
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 21, 2024
2cbea9b
test: add print tests
Yueqiao12Zhang Aug 21, 2024
7ae3914
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 21, 2024
04b3c41
fix: delete buggy code
Yueqiao12Zhang Aug 22, 2024
8203c71
style: delete print test statements
Yueqiao12Zhang Aug 22, 2024
6e51f96
fix: memory bug and header writting
Yueqiao12Zhang Aug 23, 2024
e23ed58
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 23, 2024
740d365
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 23, 2024
c0d1c78
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 23, 2024
6c6f920
test: delete unused test files
candlecao Aug 23, 2024
99ff276
Update genre.csv
Yueqiao12Zhang Aug 23, 2024
e757df8
fix: header problem by making a temp csv
candlecao Aug 23, 2024
805c464
Update mapping.json
candlecao Aug 30, 2024
81f34e7
mapping: full columns mapping
candlecao Aug 30, 2024
92bfaf1
mapping: fill the empty mappings for updated CSVs
Yueqiao12Zhang Aug 30, 2024
581e86a
refactor: ignore iso codes since they are duplicated info
Yueqiao12Zhang Aug 30, 2024
6422088
refactor: recognize special math character
Yueqiao12Zhang Aug 30, 2024
89371de
doc: add made-up URLs
Yueqiao12Zhang Aug 30, 2024
e93147e
optimize: simplify input of convert_to_csv.py
Yueqiao12Zhang Aug 30, 2024
8dc8814
fix: syntax error
Yueqiao12Zhang Aug 30, 2024
397e95e
fix: entity type is the last part of file path
Yueqiao12Zhang Aug 30, 2024
0759de7
fix: output pathname
candlecao Aug 30, 2024
10f62c2
doc: update manual based on changes that removed commandline arguments
Yueqiao12Zhang Aug 30, 2024
738a7ed
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
candlecao Aug 30, 2024
ceffbf1
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 30, 2024
66133e8
style: delete unused test code
Yueqiao12Zhang Aug 30, 2024
7cb07d0
doc: update docstring input to the correct number
Yueqiao12Zhang Aug 30, 2024
3f61a23
refactor: use writer instead of f.write()
Yueqiao12Zhang Aug 30, 2024
5a1783b
Revert "Update mapping.json"
Yueqiao12Zhang Aug 30, 2024
d47751e
doc: style update according to GPT
Yueqiao12Zhang Aug 30, 2024
3bad21e
doc: add specification for chunk_size
Yueqiao12Zhang Aug 30, 2024
6231da5
refactor: change list to set
Yueqiao12Zhang Sep 6, 2024
29dc943
fix: filename bug, readlines bug, opening wrong file bug
Yueqiao12Zhang Sep 6, 2024
28243a5
fix: auto escape by using \t not comma
Yueqiao12Zhang Sep 9, 2024
37201ab
feat: add quotechar to distinguish ", and , separator
Yueqiao12Zhang Sep 13, 2024
72b92e6
refactor: temp should be a tsv
Yueqiao12Zhang Sep 13, 2024
b64da8a
fix: correct a few temp.csv error
Yueqiao12Zhang Sep 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 74 additions & 42 deletions musicbrainz/csv/convert_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@
header = [f"{entity_type}_id"]
values = []

# the file must be from MusicBrainz's JSON data dumps.
with open(inputpath, "r", encoding="utf-8") as f:
json_data = [json.loads(m) for m in f]
IGNORE_COLUMN = ["alias", "tags", "sort-name", "disambiguation", "annotation"]


def extract(data, value: dict, first_level: bool = True, key: str = ""):
Expand All @@ -52,9 +50,10 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
if key != "":
first_level = False

if "aliases" in key or "tags" in key or "sort-name" in key:
# ignore aliases, tags, and sort-name to make output simplier
return
for i in IGNORE_COLUMN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).

probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key can be longer and only contains the ignored string. We don't want it.

if i in key:
# ignore aliases, tags, and sort-name to make output simplier
return

if isinstance(data, dict):
# the input JSON Lines format is lines of dictionaries, and the input data should be
Expand All @@ -79,7 +78,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):

# after extracting every entry of the current line, append it to the list and empty it.
values.append(copy.deepcopy(value))
value = {}
value.clear()

else:
# if this dictionary is nested, then we do not extract all info,
Expand All @@ -101,12 +100,12 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
key + "_id",
)

if k == "name":
extract(data["name"], value, first_level, key + "_name")
# if k == "name":
# extract(data["name"], value, first_level, key + "_name")
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(data[k], dict) or isinstance(data[k], list):
# if there is still a nested instance, extract further
if key.split('_')[-1] not in [
if key.split("_")[-1] not in [
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved
"area",
"artist",
"event",
Expand All @@ -115,6 +114,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
"recording",
"genres",
]:
# avoid extracting duplicate data
extract(data[k], value, first_level, key + "_" + k)

elif isinstance(data, list):
Expand Down Expand Up @@ -152,7 +152,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
return


def convert_dict_to_csv(dictionary_list: list, filename: str) -> None:
def convert_dict_to_csv(dictionary_list: list) -> None:
"""
(list, str) -> None
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved
Writes a list of dictionaries into the given file.
Expand All @@ -163,40 +163,72 @@ def convert_dict_to_csv(dictionary_list: list, filename: str) -> None:
dictionary_list: the list of dictionary that contains all the data
filename: the destination filename
"""
with open(filename, mode="w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
# Find the maximum length of lists in the dictionary

for dictionary in dictionary_list:
max_length = max(
len(v) if isinstance(v, list) else 1 for v in dictionary.values()
)

for i in range(max_length):
row = [dictionary[f"{entity_type}_id"]]
for key in header:
if key == f"{entity_type}_id":
continue

if key in dictionary:
if isinstance(dictionary[key], list):
# Append the i-th element of the list,
# or an empty string if index is out of range
row.append(
(dictionary[key])[i] if i < len(dictionary[key]) else ""
)
else:
# Append the single value
# (for non-list entries, only on the first iteration)
row.append(dictionary[key] if i == 0 else "")

# Find the maximum length of lists in the dictionary
for dictionary in dictionary_list:
max_length = max(
len(v) if isinstance(v, list) else 1 for v in dictionary.values()
)

for i in range(max_length):
row = [dictionary[f"{entity_type}_id"]]
for key in header:
if key == f"{entity_type}_id":
continue

if key in dictionary:
if isinstance(dictionary[key], list):
# Append the i-th element of the list,
# or an empty string if index is out of range
row.append(
(dictionary[key])[i] if i < len(dictionary[key]) else ""
)
else:
row.append("")
# Append the single value
# (for non-list entries, only on the first iteration)
row.append(dictionary[key] if i == 0 else "")
else:
row.append("")

with open(
"temp.csv", mode="a", newline="", encoding="utf-8"
) as csv_records:
writer_records = csv.writer(csv_records)
writer_records.writerow(row)

writer.writerow(row)

CHUNK_SIZE = 4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go at the top with the other configuration constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.


if __name__ == "__main__":
extract(json_data, {})

convert_dict_to_csv(values, outputpath)
# the file must be from MusicBrainz's JSON data dumps.
chunk = []

with open(inputpath, "r", encoding="utf-8") as f:
for line in f:
line_data = json.loads(line) # Parse each line as a JSON object
chunk.append(line_data) # Add the JSON object to the current chunk

# When the chunk reaches the desired size, process it
if len(chunk) == CHUNK_SIZE:
extract(chunk, {})
chunk.clear() # Reset the chunk
convert_dict_to_csv(values)

values.clear()

# Process any remaining data in the last chunk
if chunk:
extract(chunk, {})
chunk.clear()
convert_dict_to_csv(values)

with open(outputpath, "w", encoding="utf-8") as f:
with open("temp.csv", "r", encoding="utf-8") as f_temp:
f.write(",".join(header))
f.write("\n")

for line in f_temp:
f.write(line)
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved

os.remove("temp.csv")
Loading