Musicbrainz reduce memory used by processing by chunks #165

Yueqiao12Zhang · 2024-08-22T15:48:02Z

The bug encountered: I moved the code for reading json to other places, but forgot to delete the original part. It causes python to read the JSON twice.
After this is fixed, I read the file metadata and parse it into chunks of 4096 records. This solves the problem.

This reverts commit e79de48.

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

dchiller · 2024-08-22T18:01:54Z

Can you explain in the PR description how this solves the issue?

Or this commit (fix: delete buggy code) would be a great place to have a message about what the buggy code was!

ahankinson · 2024-08-22T21:29:44Z

You could also try using a more efficient JSON library like ujson or orjson. The built-in json library is well known to be fine for quick tasks but mot great for speed or memory use.

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

musicbrainz/csv/convert_to_csv.py

dchiller · 2024-08-26T14:24:47Z

musicbrainz/csv/convert_to_csv.py


+CHUNK_SIZE = 4096


Could this go at the top with the other configuration constant?

Can you also add a comment on why you chose 4096?

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.

musicbrainz/csv/convert_to_csv.py

change the type of recording from Q49017950 to Q482994

dchiller · 2024-08-30T11:21:01Z

@candlecao You should commit 805c464 to a different branch...

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

This reverts commit 805c464.

Yueqiao12Zhang · 2024-08-30T18:07:57Z

@candlecao You should commit 805c464 to a different branch...

fixed!

ahankinson · 2024-08-31T10:30:14Z

musicbrainz/csv/convert_to_csv.py

-    if "aliases" in key or "tags" in key or "sort-name" in key:
-        # ignore aliases, tags, and sort-name to make output simplier
-        return
+    for i in IGNORE_COLUMN:


There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).

probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.

The key can be longer and only contains the ignored string. We don't want it.

musicbrainz/csv/convert_to_csv.py

ahankinson · 2024-09-09T15:49:26Z

musicbrainz/csv/convert_to_csv.py

@@ -195,7 +195,7 @@ def convert_dict_to_csv(dictionary_list: list) -> None:
            with open(
                "temp.csv", mode="a", newline="", encoding="utf-8"
            ) as csv_records:
-                writer_records = csv.writer(csv_records)
+                writer_records = csv.writer(csv_records, delimiter="\t")


This is no longer comma-separated now... it's a different format.

Yueqiao12Zhang and others added 15 commits August 19, 2024 10:41

refactor: more ignored keys

aca1887

refactor: avoid new allocation

110263b

doc: explain if statement

f9d2091

refactor: extract and convert in chunks

f72eb1e

fix: write header

48adb26

refactor: if not first level, then don't extract name

87d9fe5

refactor: refresh values list

6659d28

Update convert_to_csv.py

e79de48

Revert "Update convert_to_csv.py"

6c565be

This reverts commit e79de48.

refactor: read jsonl in chunks

088c40d

Merge branch 'main' into musicbrainz-reduce-memory-used

9ff292f

test: add print tests

2cbea9b

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

7ae3914

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

fix: delete buggy code

04b3c41

style: delete print test statements

8203c71

Yueqiao12Zhang self-assigned this Aug 22, 2024

Yueqiao12Zhang linked an issue Aug 22, 2024 that may be closed by this pull request

Cannot open the file: MusicBrainz releases #155

Closed

Yueqiao12Zhang requested a review from dchiller August 22, 2024 16:39

Yueqiao12Zhang and others added 7 commits August 23, 2024 09:39

fix: memory bug and header writting

6e51f96

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

e23ed58

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

Merge branch 'main' into musicbrainz-reduce-memory-used

740d365

Merge branch 'main' into musicbrainz-reduce-memory-used

c0d1c78

test: delete unused test files

6c6f920

Update genre.csv

99ff276

fix: header problem by making a temp csv

e757df8

dchiller reviewed Aug 26, 2024

View reviewed changes

Update mapping.json

805c464

change the type of recording from Q49017950 to Q482994

candlecao and others added 17 commits August 30, 2024 09:08

mapping: full columns mapping

81f34e7

mapping: fill the empty mappings for updated CSVs

92bfaf1

refactor: ignore iso codes since they are duplicated info

581e86a

refactor: recognize special math character

6422088

doc: add made-up URLs

89371de

optimize: simplify input of convert_to_csv.py

e93147e

fix: syntax error

8dc8814

fix: entity type is the last part of file path

397e95e

fix: output pathname

0759de7

doc: update manual based on changes that removed commandline arguments

10f62c2

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

738a7ed

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

ceffbf1

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

style: delete unused test code

66133e8

doc: update docstring input to the correct number

7cb07d0

refactor: use writer instead of f.write()

3f61a23

Revert "Update mapping.json"

5a1783b

This reverts commit 805c464.

doc: style update according to GPT

d47751e

doc: add specification for chunk_size

3bad21e

ahankinson reviewed Aug 31, 2024

View reviewed changes

musicbrainz/csv/convert_to_csv.py Outdated Show resolved Hide resolved

Yueqiao12Zhang and others added 2 commits September 6, 2024 09:18

refactor: change list to set

6231da5

fix: filename bug, readlines bug, opening wrong file bug

29dc943

Yueqiao12Zhang linked an issue Sep 6, 2024 that may be closed by this pull request

Musicbrainz conversion: cannot append header after chunk reading #167

Open

fix: auto escape by using \t not comma

28243a5

ahankinson reviewed Sep 9, 2024

View reviewed changes

Yueqiao12Zhang added 3 commits September 13, 2024 09:55

feat: add quotechar to distinguish ", and , separator

37201ab

refactor: temp should be a tsv

72b92e6

fix: correct a few temp.csv error

b64da8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Musicbrainz reduce memory used by processing by chunks #165

Musicbrainz reduce memory used by processing by chunks #165

Yueqiao12Zhang commented Aug 22, 2024 •

edited

Loading

dchiller commented Aug 22, 2024 •

edited

Loading

ahankinson commented Aug 22, 2024

dchiller Aug 26, 2024

fujinaga Aug 26, 2024

Yueqiao12Zhang Aug 30, 2024

fujinaga Aug 30, 2024

dchiller commented Aug 30, 2024

Yueqiao12Zhang commented Aug 30, 2024

ahankinson Aug 31, 2024

Yueqiao12Zhang Sep 6, 2024

ahankinson Sep 9, 2024


		CHUNK_SIZE = 4096

Musicbrainz reduce memory used by processing by chunks #165

Are you sure you want to change the base?

Musicbrainz reduce memory used by processing by chunks #165

Conversation

Yueqiao12Zhang commented Aug 22, 2024 • edited Loading

dchiller commented Aug 22, 2024 • edited Loading

ahankinson commented Aug 22, 2024

dchiller Aug 26, 2024

Choose a reason for hiding this comment

fujinaga Aug 26, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 30, 2024

Choose a reason for hiding this comment

fujinaga Aug 30, 2024

Choose a reason for hiding this comment

dchiller commented Aug 30, 2024

Yueqiao12Zhang commented Aug 30, 2024

ahankinson Aug 31, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Sep 6, 2024

Choose a reason for hiding this comment

ahankinson Sep 9, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang commented Aug 22, 2024 •

edited

Loading

dchiller commented Aug 22, 2024 •

edited

Loading