Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musicbrainz reduce memory used by processing by chunks #165

Merged
merged 67 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
aca1887
refactor: more ignored keys
Yueqiao12Zhang Aug 19, 2024
110263b
refactor: avoid new allocation
Yueqiao12Zhang Aug 19, 2024
f9d2091
doc: explain if statement
Yueqiao12Zhang Aug 19, 2024
f72eb1e
refactor: extract and convert in chunks
Yueqiao12Zhang Aug 19, 2024
48adb26
fix: write header
Yueqiao12Zhang Aug 19, 2024
87d9fe5
refactor: if not first level, then don't extract name
Yueqiao12Zhang Aug 19, 2024
6659d28
refactor: refresh values list
Yueqiao12Zhang Aug 19, 2024
e79de48
Update convert_to_csv.py
Yueqiao12Zhang Aug 20, 2024
6c565be
Revert "Update convert_to_csv.py"
Yueqiao12Zhang Aug 20, 2024
088c40d
refactor: read jsonl in chunks
Yueqiao12Zhang Aug 20, 2024
9ff292f
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 21, 2024
2cbea9b
test: add print tests
Yueqiao12Zhang Aug 21, 2024
7ae3914
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 21, 2024
04b3c41
fix: delete buggy code
Yueqiao12Zhang Aug 22, 2024
8203c71
style: delete print test statements
Yueqiao12Zhang Aug 22, 2024
6e51f96
fix: memory bug and header writting
Yueqiao12Zhang Aug 23, 2024
e23ed58
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 23, 2024
740d365
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 23, 2024
c0d1c78
Merge branch 'main' into musicbrainz-reduce-memory-used
Yueqiao12Zhang Aug 23, 2024
6c6f920
test: delete unused test files
candlecao Aug 23, 2024
99ff276
Update genre.csv
Yueqiao12Zhang Aug 23, 2024
e757df8
fix: header problem by making a temp csv
candlecao Aug 23, 2024
805c464
Update mapping.json
candlecao Aug 30, 2024
81f34e7
mapping: full columns mapping
candlecao Aug 30, 2024
92bfaf1
mapping: fill the empty mappings for updated CSVs
Yueqiao12Zhang Aug 30, 2024
581e86a
refactor: ignore iso codes since they are duplicated info
Yueqiao12Zhang Aug 30, 2024
6422088
refactor: recognize special math character
Yueqiao12Zhang Aug 30, 2024
89371de
doc: add made-up URLs
Yueqiao12Zhang Aug 30, 2024
e93147e
optimize: simplify input of convert_to_csv.py
Yueqiao12Zhang Aug 30, 2024
8dc8814
fix: syntax error
Yueqiao12Zhang Aug 30, 2024
397e95e
fix: entity type is the last part of file path
Yueqiao12Zhang Aug 30, 2024
0759de7
fix: output pathname
candlecao Aug 30, 2024
10f62c2
doc: update manual based on changes that removed commandline arguments
Yueqiao12Zhang Aug 30, 2024
738a7ed
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
candlecao Aug 30, 2024
ceffbf1
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
Yueqiao12Zhang Aug 30, 2024
66133e8
style: delete unused test code
Yueqiao12Zhang Aug 30, 2024
7cb07d0
doc: update docstring input to the correct number
Yueqiao12Zhang Aug 30, 2024
3f61a23
refactor: use writer instead of f.write()
Yueqiao12Zhang Aug 30, 2024
5a1783b
Revert "Update mapping.json"
Yueqiao12Zhang Aug 30, 2024
d47751e
doc: style update according to GPT
Yueqiao12Zhang Aug 30, 2024
3bad21e
doc: add specification for chunk_size
Yueqiao12Zhang Aug 30, 2024
6231da5
refactor: change list to set
Yueqiao12Zhang Sep 6, 2024
29dc943
fix: filename bug, readlines bug, opening wrong file bug
Yueqiao12Zhang Sep 6, 2024
28243a5
fix: auto escape by using \t not comma
Yueqiao12Zhang Sep 9, 2024
37201ab
feat: add quotechar to distinguish ", and , separator
Yueqiao12Zhang Sep 13, 2024
72b92e6
refactor: temp should be a tsv
Yueqiao12Zhang Sep 13, 2024
b64da8a
fix: correct a few temp.csv error
Yueqiao12Zhang Sep 13, 2024
241dea3
fix: file suffix bug
candlecao Sep 20, 2024
956f924
Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…
candlecao Sep 20, 2024
3146891
mappings: update by Junjun
candlecao Sep 20, 2024
b58e618
Update .gitignore
candlecao Sep 20, 2024
d6e48bf
test: only include genre.csv since it's not updated often and is slow…
candlecao Sep 20, 2024
23a6861
test: remove old test files
candlecao Sep 20, 2024
222e715
Update .gitignore
candlecao Sep 20, 2024
dcaf73e
Update .gitignore
candlecao Sep 20, 2024
63c8914
Merge branch 'main' into musicbrainz-reduce-memory-used
candlecao Sep 20, 2024
8d4ef18
doc: update lock file based on pyproject.toml
candlecao Sep 27, 2024
ed26803
refactor: visualize file processing
candlecao Sep 27, 2024
622cefe
feat: add multi-file export option
candlecao Sep 27, 2024
8e99189
mapping: add reconciled columns
candlecao Sep 27, 2024
84e4b1b
doc: add other reconciled columns
candlecao Sep 27, 2024
798d448
doc: add specification about multi-file output
candlecao Sep 27, 2024
13e8601
update mapping.json
candlecao Oct 21, 2024
6c6401e
refactor: add type recognition to musicbrainz
Yueqiao12Zhang Oct 25, 2024
d380d46
ignore: add ignore separate ttls
candlecao Oct 25, 2024
c4be2a5
Merge branch 'main' into musicbrainz-reduce-memory-used
candlecao Oct 25, 2024
9fe4df5
test: remove test print
candlecao Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 74 additions & 42 deletions musicbrainz/csv/convert_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@
header = [f"{entity_type}_id"]
values = []

# the file must be from MusicBrainz's JSON data dumps.
with open(inputpath, "r", encoding="utf-8") as f:
json_data = [json.loads(m) for m in f]
IGNORE_COLUMN = ["alias", "tags", "sort-name", "disambiguation", "annotation"]


def extract(data, value: dict, first_level: bool = True, key: str = ""):
Expand All @@ -52,9 +50,10 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
if key != "":
first_level = False

if "aliases" in key or "tags" in key or "sort-name" in key:
# ignore aliases, tags, and sort-name to make output simplier
return
for i in IGNORE_COLUMN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).

probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key can be longer and only contains the ignored string. We don't want it.

if i in key:
# ignore aliases, tags, and sort-name to make output simplier
return

if isinstance(data, dict):
# the input JSON Lines format is lines of dictionaries, and the input data should be
Expand All @@ -79,7 +78,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):

# after extracting every entry of the current line, append it to the list and empty it.
values.append(copy.deepcopy(value))
value = {}
value.clear()

else:
# if this dictionary is nested, then we do not extract all info,
Expand All @@ -101,12 +100,12 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
key + "_id",
)

if k == "name":
extract(data["name"], value, first_level, key + "_name")
# if k == "name":
# extract(data["name"], value, first_level, key + "_name")
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(data[k], dict) or isinstance(data[k], list):
# if there is still a nested instance, extract further
if key.split('_')[-1] not in [
if key.split("_")[-1] not in [
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved
"area",
"artist",
"event",
Expand All @@ -115,6 +114,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
"recording",
"genres",
]:
# avoid extracting duplicate data
extract(data[k], value, first_level, key + "_" + k)

elif isinstance(data, list):
Expand Down Expand Up @@ -152,7 +152,7 @@ def extract(data, value: dict, first_level: bool = True, key: str = ""):
return


def convert_dict_to_csv(dictionary_list: list, filename: str) -> None:
def convert_dict_to_csv(dictionary_list: list) -> None:
"""
(list, str) -> None
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved
Writes a list of dictionaries into the given file.
Expand All @@ -163,40 +163,72 @@ def convert_dict_to_csv(dictionary_list: list, filename: str) -> None:
dictionary_list: the list of dictionary that contains all the data
filename: the destination filename
"""
with open(filename, mode="w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
# Find the maximum length of lists in the dictionary

for dictionary in dictionary_list:
max_length = max(
len(v) if isinstance(v, list) else 1 for v in dictionary.values()
)

for i in range(max_length):
row = [dictionary[f"{entity_type}_id"]]
for key in header:
if key == f"{entity_type}_id":
continue

if key in dictionary:
if isinstance(dictionary[key], list):
# Append the i-th element of the list,
# or an empty string if index is out of range
row.append(
(dictionary[key])[i] if i < len(dictionary[key]) else ""
)
else:
# Append the single value
# (for non-list entries, only on the first iteration)
row.append(dictionary[key] if i == 0 else "")

# Find the maximum length of lists in the dictionary
for dictionary in dictionary_list:
max_length = max(
len(v) if isinstance(v, list) else 1 for v in dictionary.values()
)

for i in range(max_length):
row = [dictionary[f"{entity_type}_id"]]
for key in header:
if key == f"{entity_type}_id":
continue

if key in dictionary:
if isinstance(dictionary[key], list):
# Append the i-th element of the list,
# or an empty string if index is out of range
row.append(
(dictionary[key])[i] if i < len(dictionary[key]) else ""
)
else:
row.append("")
# Append the single value
# (for non-list entries, only on the first iteration)
row.append(dictionary[key] if i == 0 else "")
else:
row.append("")

with open(
"temp.csv", mode="a", newline="", encoding="utf-8"
) as csv_records:
writer_records = csv.writer(csv_records)
writer_records.writerow(row)

writer.writerow(row)

CHUNK_SIZE = 4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go at the top with the other configuration constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.


if __name__ == "__main__":
extract(json_data, {})

convert_dict_to_csv(values, outputpath)
# the file must be from MusicBrainz's JSON data dumps.
chunk = []

with open(inputpath, "r", encoding="utf-8") as f:
for line in f:
line_data = json.loads(line) # Parse each line as a JSON object
chunk.append(line_data) # Add the JSON object to the current chunk

# When the chunk reaches the desired size, process it
if len(chunk) == CHUNK_SIZE:
extract(chunk, {})
chunk.clear() # Reset the chunk
convert_dict_to_csv(values)

values.clear()

# Process any remaining data in the last chunk
if chunk:
extract(chunk, {})
chunk.clear()
convert_dict_to_csv(values)

with open(outputpath, "w", encoding="utf-8") as f:
with open("temp.csv", "r", encoding="utf-8") as f_temp:
f.write(",".join(header))
f.write("\n")

for line in f_temp:
f.write(line)
Yueqiao12Zhang marked this conversation as resolved.
Show resolved Hide resolved

os.remove("temp.csv")
Loading