You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all 👋
I'm using Spark Structured Streaming to stream from one Hudi table to another Hudi table.
I noticed that when stream started for the first time, each batch was relatively fast, with an average duration of 30 seconds, but it increased over time and stabilized at 300 seconds.
I tested many configurations and came to a conclusion that the this only happens when I have hoodie.cleaner.hours.retained=72 (3 days). When I reverted to default (10 commits retained), batch latency returned back to 30 seconds... and increasing it again, made it increase during 3 days and stabilize again at 300 seconds.
Tried using async clean/archive with this lock concurrency settings:
It improved a bit, but still 200/250 seconds with 72h of commits.
In Spark UI, SQL/Dataframe tab, a normal batch looks like the screenshot:
in the example, batch took 3.3 min... but if I open all the 19 related Succeeded Jobs and SUM the time each one took, it sums up to only 30 seconds... apparently the rest of the time is spent by non-spark operations, but found nothing in the logs during these gaps (this cluster is running only this stream so it's not waiting for resources)... the Job that took most of the time (26 sec) has the description: "Preparing compaction metadata".
Regarding my .hoodie/ folder here are some stats:
whole hudi table folder is 28GB with 30k objects
.hoodie folder is 11GB with 29k objects!
around 5k files in .hoodie/ (filenames contain rollback, clean, commit, etc... )
1347 files / 129 MB in .hoodie/archived/
most of data is in .hoodie/metadata/ , as per screenshot:
**Steps to reproduce the behavior: **
set hoodie.cleaner.hours.retained=72 and batch duration increased 10 times
Expected behavior
batch duration wouldn't increase significantly
Environment Description
Hudi 0.15, Spark 3.5.1, AWS S3; Running on Kubernetes
Average input rate of 50 records/second;
Table is COW and partitioned by day;
Metadata enabled, with column_stats, and record index;
The text was updated successfully, but these errors were encountered:
When you increase the number of commits retained in Hudi, it can lead to a significant increase in the streaming batch duration. This is because Hudi needs to maintain a larger commit history, which can cause the following issues:
Increased metadata overhead: With more commits retained, Hudi needs to store and manage more metadata, which can lead to increased overhead in terms of storage, memory, and processing time.
Slower commit processing: When there are more commits to process, Hudi's commit processing time increases, leading to slower streaming batch durations.
Higher memory usage: Retaining more commits requires more memory to store the commit history, which can lead to increased memory usage and potential memory issues.
The main reason to slowness is Hudi needs to load the equivalent number of commits to perform index lookups, a process essential for handling updates.
Hi @rangareddy,
I'm keeping 72h of commits in order to have a 3 day window of changes, to allow downstream streams time to catch up in case of failure or so... similar to having 7 days of retention in Kafka. Because streaming from Hudi relies on Incremental Query feature, which in turn, uses retained commits for that.
So if I reduce it to 24h, if a downstream stops for than 24h, once I restart it, won't my downstream abort, or worst, lose data? Am I thinking correctly?
Hi all 👋
I'm using Spark Structured Streaming to stream from one Hudi table to another Hudi table.
I noticed that when stream started for the first time, each batch was relatively fast, with an average duration of 30 seconds, but it increased over time and stabilized at 300 seconds.
I tested many configurations and came to a conclusion that the this only happens when I have
hoodie.cleaner.hours.retained=72
(3 days). When I reverted to default (10 commits retained), batch latency returned back to 30 seconds... and increasing it again, made it increase during 3 days and stabilize again at 300 seconds.Tried using async clean/archive with this lock concurrency settings:
It improved a bit, but still 200/250 seconds with 72h of commits.
In Spark UI, SQL/Dataframe tab, a normal batch looks like the screenshot:
in the example, batch took 3.3 min... but if I open all the 19 related Succeeded Jobs and SUM the time each one took, it sums up to only 30 seconds... apparently the rest of the time is spent by non-spark operations, but found nothing in the logs during these gaps (this cluster is running only this stream so it's not waiting for resources)... the Job that took most of the time (26 sec) has the description: "Preparing compaction metadata".
Regarding my
.hoodie/
folder here are some stats:.hoodie
folder is 11GB with 29k objects!.hoodie/
(filenames contain rollback, clean, commit, etc... ).hoodie/archived/
.hoodie/metadata/
, as per screenshot:**Steps to reproduce the behavior: **
set
hoodie.cleaner.hours.retained=72
and batch duration increased 10 timesExpected behavior
batch duration wouldn't increase significantly
Environment Description
The text was updated successfully, but these errors were encountered: