Incremental stats sitewide #3114

amCap1712 · 2025-01-05T18:39:10Z

In the ListenBrainz Spark cluster, full dump listens (which remain constant for ~15 days) and incremental listens (ingested daily) are the two main sources of data. Incremental listens are cleared whenever a new full dump is imported. Aggregating full dump listens daily for various statistics is inefficient since this data does not change.

To optimize this process:

A partial aggregate is generated from the full dump listens the first time a stat is requested. This partial aggregate is stored in HDFS for future use, eliminating the need for redundant full dump aggregation.
Incremental listens are aggregated daily. Although all incremental listens since the full dump’s import are used (not just today’s), this introduces some redundant computation.
The incremental aggregate is combined with the existing partial aggregate, forming a combined aggregate from which final statistics are generated.

For non-sitewide statistics, further optimization is possible: If an entity’s listens (e.g., for a user) are not present in the incremental listens, its statistics do not need to be recalculated. Similarly, entity-level listener stats can skip recomputation when relevant data is absent in incremental listens.

pep8speaks · 2025-01-05T18:39:18Z

Hello @amCap1712! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2025-01-22 09:32:10 UTC

amCap1712 · 2025-01-07T20:44:46Z

Note that for sitewide statistics there is a slight inaccuracy in the final counts of listens because we can enforce the user listen count limit only per aggregate to do it efficiently, therefore in the worst case (both the full dump listens and the incremental listens have max allowed number of listens for a user) the actual user listen count limit can be upto 2x than the desired limit.

mayhem

What a monster PR that ended up being... tame. I think your approach makes total sense and I love how modular everything is -- consistent and easy to read in the end. Well done! I can't wait to see this in prod!

mayhem · 2025-01-13T10:29:39Z

listenbrainz_spark/schema.py

@@ -3,6 +3,11 @@
 from pyspark.sql.types import StructField, StructType, ArrayType, StringType, TimestampType, FloatType, \
    IntegerType, LongType

+BOOKKEEPING_SCHEMA = StructType([


What is Bookkeeping in this context? perhaps a comment here or elsewhere defining this might be good.

Keeping track of the from_date and the to_date used to create the partial aggressive from full dump listens. Assuming dumps are imported twice a month, the aggregates for weekly stats need to be refreshed (generated from different range of listens in the full dump) sooner. The existing_aggrrgate_usable method reads this from/to date from bookkeeping path and compares it with today's request to determine if the aggregate needs to be recreated.

Will add a comment.

mayhem · 2025-01-13T10:55:49Z

listenbrainz_spark/stats/incremental/sitewide/artist.py

+    def get_partial_aggregate_schema(self):
+        return StructType([
+            StructField("artist_name", StringType(), nullable=False),
+            StructField("artist_mbid", StringType(), nullable=True),


why is artist_mbid nullable?

Because of unmapped listens.

amCap1712 added 21 commits January 8, 2025 02:10

interim checkin

169a3f6

fix table use

006b367

fix combined table

4805bd2

fix partial df use

11c8042

add per user limit for sitewide stats

ca2e1fb

testing more scenarios

bb62d95

refactor incremental sitewide stats

d64f7ea

fix import

570e3ce

add all time incremental stats for other entities

6fae52c

Delete partial sitewide aggregates on import of full dump

4e037c2

Add bookkeeping for using aggregates of any stats_range

85ea8a4

fix imports

0d64068

fix metadata path

1fe0397

add logging to debug

da7fce5

fix existing agg usable check

28551bd

add schema to json read

6b97954

fix skip_trash arg in dump upload

b9fd965

Refactor SitewideEntity for sharing with other stats

dd9ec00

Fix constructors

1b9df3a

Fix call to generate_stats

f1af83c

Fix call to generate_stats - 2

a9a62ce

amCap1712 force-pushed the incremental-stats-sitewide branch from 46caca9 to a9a62ce Compare January 7, 2025 20:40

amCap1712 marked this pull request as ready for review January 7, 2025 20:41

amCap1712 added 5 commits January 8, 2025 14:21

Fix aggregates cleanup and remove outdated tests

aeb4385

add missing path

53a72f9

make sitewide listening activity incremental

e60295f

simply generating stats if incremental dump doesn't exist

7dca0d0

Refactor create messages and stats validation into class

836236d

amCap1712 requested a review from mayhem January 10, 2025 09:59

mayhem approved these changes Jan 13, 2025

View reviewed changes

add comment on bookkeeping schema

9295ddc

amCap1712 merged commit 0854fc0 into master Jan 22, 2025
1 check failed

amCap1712 deleted the incremental-stats-sitewide branch January 22, 2025 09:32

This was referenced Jan 22, 2025

Incremental user stats #3115

Merged

Refactor incremental stats framework #3147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental stats sitewide #3114

Incremental stats sitewide #3114

amCap1712 commented Jan 5, 2025 •

edited

Loading

pep8speaks commented Jan 5, 2025 •

edited

Loading

amCap1712 commented Jan 7, 2025

mayhem left a comment

mayhem Jan 13, 2025

amCap1712 Jan 13, 2025

amCap1712 Jan 13, 2025

mayhem Jan 13, 2025

amCap1712 Jan 13, 2025

Incremental stats sitewide #3114

Incremental stats sitewide #3114

Conversation

amCap1712 commented Jan 5, 2025 • edited Loading

pep8speaks commented Jan 5, 2025 • edited Loading

Comment last updated at 2025-01-22 09:32:10 UTC

amCap1712 commented Jan 7, 2025

mayhem left a comment

Choose a reason for hiding this comment

mayhem Jan 13, 2025

Choose a reason for hiding this comment

amCap1712 Jan 13, 2025

Choose a reason for hiding this comment

amCap1712 Jan 13, 2025

Choose a reason for hiding this comment

mayhem Jan 13, 2025

Choose a reason for hiding this comment

amCap1712 Jan 13, 2025

Choose a reason for hiding this comment

amCap1712 commented Jan 5, 2025 •

edited

Loading

pep8speaks commented Jan 5, 2025 •

edited

Loading