Skip to content

Compact listen dumps in HDFS#3211

Merged
amCap1712 merged 2 commits intomasterfrom
dumps-compaction
Mar 4, 2025
Merged

Compact listen dumps in HDFS#3211
amCap1712 merged 2 commits intomasterfrom
dumps-compaction

Conversation

@amCap1712
Copy link
Copy Markdown
Member

Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment.

Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions.

The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.

Full dumps when imported in Spark are partitioned by listened_at's year
and month for storage in HDFS. Incremental dumps are imported everyday
and appended to a single incremental.parquet. Deleted listens are similarly
stored in deleted-listens.parquet and deleted-user-listen-history.parquet.
At run time, the full dumps are read and concatenated with incremental
dumps and the deleted listens are filtered out from the union. When a new
full dump is imported, it contains all the listens till that time and
all the deleted listens removed and the additional parquet files for
incremental and deleted listens are removed. This happens on a biweekly
timeline at the moment.

Full dumps are cumbersome to produce, hence we want to reduce our dependence
on them inside of ListenBrainz and Spark. After an initial full dump import
to seed the cluster, we intend to get rid of the biweekly full dump imports
and just rely on incremental dumps continuously. Hence, we need to rethink
some steps in how incremental listens are stored in the spark cluster and
how to implement deletions.

The solution I have come up with is replace the full dump import step with a
compaction step which reads all the partitioned base listens combines them
with incremental listens, removes the deleted listens and writes them back
to HDFS in the partitioned format. Everything else remains same.
@amCap1712 amCap1712 changed the title Dumps compaction Compact listen dumps in HDFS Mar 3, 2025
@amCap1712 amCap1712 requested a review from mayhem March 3, 2025 16:59
Copy link
Copy Markdown
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't wait to see how this works!!

@amCap1712 amCap1712 merged commit 1581e68 into master Mar 4, 2025
@amCap1712 amCap1712 deleted the dumps-compaction branch March 4, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants