-
-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental stats sitewide #3114
Conversation
Hello @amCap1712! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2025-01-22 09:32:10 UTC |
46caca9
to
a9a62ce
Compare
Note that for sitewide statistics there is a slight inaccuracy in the final counts of listens because we can enforce the user listen count limit only per aggregate to do it efficiently, therefore in the worst case (both the full dump listens and the incremental listens have max allowed number of listens for a user) the actual user listen count limit can be upto 2x than the desired limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a monster PR that ended up being... tame. I think your approach makes total sense and I love how modular everything is -- consistent and easy to read in the end. Well done! I can't wait to see this in prod!
@@ -3,6 +3,11 @@ | |||
from pyspark.sql.types import StructField, StructType, ArrayType, StringType, TimestampType, FloatType, \ | |||
IntegerType, LongType | |||
|
|||
BOOKKEEPING_SCHEMA = StructType([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Bookkeeping in this context? perhaps a comment here or elsewhere defining this might be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping track of the from_date and the to_date used to create the partial aggressive from full dump listens. Assuming dumps are imported twice a month, the aggregates for weekly stats need to be refreshed (generated from different range of listens in the full dump) sooner. The existing_aggrrgate_usable method reads this from/to date from bookkeeping path and compares it with today's request to determine if the aggregate needs to be recreated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a comment.
def get_partial_aggregate_schema(self): | ||
return StructType([ | ||
StructField("artist_name", StringType(), nullable=False), | ||
StructField("artist_mbid", StringType(), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is artist_mbid nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of unmapped listens.
In the ListenBrainz Spark cluster, full dump listens (which remain constant for ~15 days) and incremental listens (ingested daily) are the two main sources of data. Incremental listens are cleared whenever a new full dump is imported. Aggregating full dump listens daily for various statistics is inefficient since this data does not change.
To optimize this process:
For non-sitewide statistics, further optimization is possible: If an entity’s listens (e.g., for a user) are not present in the incremental listens, its statistics do not need to be recalculated. Similarly, entity-level listener stats can skip recomputation when relevant data is absent in incremental listens.