Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental stats sitewide #3114

Merged
merged 27 commits into from
Jan 22, 2025
Merged

Incremental stats sitewide #3114

merged 27 commits into from
Jan 22, 2025

Conversation

amCap1712
Copy link
Member

@amCap1712 amCap1712 commented Jan 5, 2025

In the ListenBrainz Spark cluster, full dump listens (which remain constant for ~15 days) and incremental listens (ingested daily) are the two main sources of data. Incremental listens are cleared whenever a new full dump is imported. Aggregating full dump listens daily for various statistics is inefficient since this data does not change.

To optimize this process:

  1. A partial aggregate is generated from the full dump listens the first time a stat is requested. This partial aggregate is stored in HDFS for future use, eliminating the need for redundant full dump aggregation.
  2. Incremental listens are aggregated daily. Although all incremental listens since the full dump’s import are used (not just today’s), this introduces some redundant computation.
  3. The incremental aggregate is combined with the existing partial aggregate, forming a combined aggregate from which final statistics are generated.

For non-sitewide statistics, further optimization is possible: If an entity’s listens (e.g., for a user) are not present in the incremental listens, its statistics do not need to be recalculated. Similarly, entity-level listener stats can skip recomputation when relevant data is absent in incremental listens.

@pep8speaks
Copy link

pep8speaks commented Jan 5, 2025

Hello @amCap1712! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2025-01-22 09:32:10 UTC

@amCap1712 amCap1712 force-pushed the incremental-stats-sitewide branch from 46caca9 to a9a62ce Compare January 7, 2025 20:40
@amCap1712 amCap1712 marked this pull request as ready for review January 7, 2025 20:41
@amCap1712
Copy link
Member Author

Note that for sitewide statistics there is a slight inaccuracy in the final counts of listens because we can enforce the user listen count limit only per aggregate to do it efficiently, therefore in the worst case (both the full dump listens and the incremental listens have max allowed number of listens for a user) the actual user listen count limit can be upto 2x than the desired limit.

@amCap1712 amCap1712 requested a review from mayhem January 10, 2025 09:59
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a monster PR that ended up being... tame. I think your approach makes total sense and I love how modular everything is -- consistent and easy to read in the end. Well done! I can't wait to see this in prod!

@@ -3,6 +3,11 @@
from pyspark.sql.types import StructField, StructType, ArrayType, StringType, TimestampType, FloatType, \
IntegerType, LongType

BOOKKEEPING_SCHEMA = StructType([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Bookkeeping in this context? perhaps a comment here or elsewhere defining this might be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track of the from_date and the to_date used to create the partial aggressive from full dump listens. Assuming dumps are imported twice a month, the aggregates for weekly stats need to be refreshed (generated from different range of listens in the full dump) sooner. The existing_aggrrgate_usable method reads this from/to date from bookkeeping path and compares it with today's request to determine if the aggregate needs to be recreated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment.

def get_partial_aggregate_schema(self):
return StructType([
StructField("artist_name", StringType(), nullable=False),
StructField("artist_mbid", StringType(), nullable=True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is artist_mbid nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of unmapped listens.

@amCap1712 amCap1712 merged commit 0854fc0 into master Jan 22, 2025
1 check failed
@amCap1712 amCap1712 deleted the incremental-stats-sitewide branch January 22, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants