Compact listen dumps in HDFS by amCap1712 · Pull Request #3211 · metabrainz/listenbrainz-server

amCap1712 · 2025-03-03T16:58:52Z

Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment.

Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions.

The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.

Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment. Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions. The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.

mayhem

I can't wait to see how this works!!

amCap1712 added 2 commits March 3, 2025 22:27

do not bookkeep incremental aggregate if inc dumps do not exist

37badbb

amCap1712 changed the title ~~Dumps compaction~~ Compact listen dumps in HDFS Mar 3, 2025

amCap1712 requested a review from mayhem March 3, 2025 16:59

mayhem approved these changes Mar 4, 2025

View reviewed changes

amCap1712 merged commit 1581e68 into master Mar 4, 2025

amCap1712 deleted the dumps-compaction branch March 4, 2025 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compact listen dumps in HDFS#3211

Compact listen dumps in HDFS#3211
amCap1712 merged 2 commits intomasterfrom
dumps-compaction

amCap1712 commented Mar 3, 2025

Uh oh!

mayhem left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

amCap1712 commented Mar 3, 2025

Uh oh!

mayhem left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants