Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37253] Add state size in application status and deployment metrics #941

Merged
merged 3 commits into from
Feb 7, 2025

Conversation

mxm
Copy link
Contributor

@mxm mxm commented Feb 4, 2025

This adds state size, i.e. the size of the last completed checkpoint, to the
deployment status. It also exposes the state size as a deployment metric.

@mxm mxm requested a review from gyfora February 4, 2025 16:04
@mxm mxm force-pushed the FLINK-37253 branch 5 times, most recently from f0ecf40 to dd22b3d Compare February 5, 2025 12:04
@@ -403,6 +403,28 @@ public static Long calculateClusterMemoryUsage(Configuration conf, int taskManag
return tmTotalMemory + jmTotalMemory;
}

public static Long calculateClusterStateSize(Configuration conf, int taskManagerReplicas) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be called totalClusterMemorySize instead of state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think this is used anywhere... :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State size or checkpoint size isn't directly related to the cluster memory size. For the heap memory backend, we would expect the state size to be lower than the overall memory. For RocksDB, it could even exceed the cluster memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but I still don't get 3 things:

  • Where is this used?
  • Why do we need this bad approximation if state size metrics are available from Flink?
  • This is basically just total memory, why do we call it state size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see your comment in #941 (comment), it was somehow hidden when I replied.

Sorry, this code was unused code. I have removed it.

mxm added 2 commits February 5, 2025 16:07
…rics

This adds state size, i.e. the size of the last completed checkpoint, to the
deployment status. It also exposes the state size as a deployment metric.
@gyfora
Copy link
Contributor

gyfora commented Feb 5, 2025

Have you tested this in a (local) kubernetes env with different Flink versions? Does it work as expected?

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the manually tested for correctness for the supported Flink versions and e2es pass then good to go

@mxm
Copy link
Contributor Author

mxm commented Feb 5, 2025

Tried it out on a local k8s cluster with various Flink versions:

image image image image

@mxm mxm merged commit b7d6f9d into apache:main Feb 7, 2025
115 of 118 checks passed
@mxm mxm deleted the FLINK-37253 branch February 7, 2025 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants