Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musicbrainz reduce memory used by processing by chunks #165

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

Yueqiao12Zhang
Copy link
Contributor

@Yueqiao12Zhang Yueqiao12Zhang commented Aug 22, 2024

The bug encountered: I moved the code for reading json to other places, but forgot to delete the original part. It causes python to read the JSON twice.
After this is fixed, I read the file metadata and parse it into chunks of 4096 records. This solves the problem.

@Yueqiao12Zhang Yueqiao12Zhang self-assigned this Aug 22, 2024
@Yueqiao12Zhang Yueqiao12Zhang linked an issue Aug 22, 2024 that may be closed by this pull request
@dchiller
Copy link
Contributor

dchiller commented Aug 22, 2024

Can you explain in the PR description how this solves the issue?

Or this commit (fix: delete buggy code) would be a great place to have a message about what the buggy code was!

@ahankinson
Copy link
Member

You could also try using a more efficient JSON library like ujson or orjson. The built-in json library is well known to be fine for quick tasks but mot great for speed or memory use.

musicbrainz/csv/convert_to_csv.py Outdated Show resolved Hide resolved

CHUNK_SIZE = 4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go at the top with the other configuration constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.

musicbrainz/csv/convert_to_csv.py Outdated Show resolved Hide resolved
musicbrainz/csv/convert_to_csv.py Outdated Show resolved Hide resolved
change the type of recording from Q49017950 to Q482994
@dchiller
Copy link
Contributor

@candlecao You should commit 805c464 to a different branch...

@Yueqiao12Zhang
Copy link
Contributor Author

@candlecao You should commit 805c464 to a different branch...

fixed!

if "aliases" in key or "tags" in key or "sort-name" in key:
# ignore aliases, tags, and sort-name to make output simplier
return
for i in IGNORE_COLUMN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).

probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key can be longer and only contains the ignored string. We don't want it.

@@ -195,7 +195,7 @@ def convert_dict_to_csv(dictionary_list: list) -> None:
with open(
"temp.csv", mode="a", newline="", encoding="utf-8"
) as csv_records:
writer_records = csv.writer(csv_records)
writer_records = csv.writer(csv_records, delimiter="\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer comma-separated now... it's a different format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Musicbrainz conversion: cannot append header after chunk reading Cannot open the file: MusicBrainz releases
5 participants