Conversation
Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment. Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions. The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.
mayhem
approved these changes
Mar 4, 2025
Member
mayhem
left a comment
There was a problem hiding this comment.
I can't wait to see how this works!!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment.
Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions.
The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.