-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Musicbrainz reduce memory used by processing by chunks #165
base: main
Are you sure you want to change the base?
Conversation
This reverts commit e79de48.
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
Can you explain in the PR description how this solves the issue? Or this commit (fix: delete buggy code) would be a great place to have a message about what the buggy code was! |
You could also try using a more efficient JSON library like ujson or orjson. The built-in json library is well known to be fine for quick tasks but mot great for speed or memory use. |
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
musicbrainz/csv/convert_to_csv.py
Outdated
|
||
CHUNK_SIZE = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this go at the top with the other configuration constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.
So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.
change the type of recording from Q49017950 to Q482994
@candlecao You should commit 805c464 to a different branch... |
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
This reverts commit 805c464.
fixed! |
if "aliases" in key or "tags" in key or "sort-name" in key: | ||
# ignore aliases, tags, and sort-name to make output simplier | ||
return | ||
for i in IGNORE_COLUMN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no need to loop here. Make ignore_column a set, and then just do “if key in ignore_column”. Since it’s a set the lookup is O(1).
probably not a huge difference here but it’s good to build these sorts of optimizations into your normal repertoire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key can be longer and only contains the ignored string. We don't want it.
musicbrainz/csv/convert_to_csv.py
Outdated
@@ -195,7 +195,7 @@ def convert_dict_to_csv(dictionary_list: list) -> None: | |||
with open( | |||
"temp.csv", mode="a", newline="", encoding="utf-8" | |||
) as csv_records: | |||
writer_records = csv.writer(csv_records) | |||
writer_records = csv.writer(csv_records, delimiter="\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer comma-separated now... it's a different format.
The bug encountered: I moved the code for reading json to other places, but forgot to delete the original part. It causes python to read the JSON twice.
After this is fixed, I read the file metadata and parse it into chunks of 4096 records. This solves the problem.