Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU device watermark metrics #11457

Merged
merged 4 commits into from
Sep 26, 2024
Merged

Conversation

zpuller
Copy link
Collaborator

@zpuller zpuller commented Sep 10, 2024

This PR builds on NVIDIA/spark-rapids-jni#2392 to add a metric to spark to capture the highest amount of device memory we have allocated over the lifespan of a task (as a Spark accumulator).

I have a corresponding change prepared for https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html#metrics

@sameerz sameerz marked this pull request as draft September 11, 2024 00:05
@sameerz sameerz marked this pull request as draft September 11, 2024 00:05
@sameerz sameerz marked this pull request as draft September 11, 2024 00:05
@zpuller zpuller marked this pull request as ready for review September 24, 2024 16:38
@zpuller zpuller changed the title [DRAFT] Device metrics GPU device watermark metrics Sep 24, 2024
@zpuller
Copy link
Collaborator Author

zpuller commented Sep 24, 2024

The JNI PR was just merged so I think this PR won't build until the next SNAPSHOT jar is produced for the JNI

revans2
revans2 previously approved these changes Sep 24, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits around naming.

Signed-off-by: Zach Puller <[email protected]>
revans2
revans2 previously approved these changes Sep 24, 2024
Signed-off-by: Zach Puller <[email protected]>
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

override def add(v: jl.Long): Unit = {
_value += v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Although I understand this function will be called once on task completion, I wonder if it would be more straightforward to understand if this were _value = _value.max(v). Is HighWatermarkAccumulator not just a max accumulator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't have a strong opinion about this. I think, yes, it is a max accumulator. Do others have thoughts on this?

Copy link
Collaborator

@jihoonson jihoonson Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  /**
   * Takes the inputs and accumulates.
   */
  def add(v: IN): Unit

The above snippet shows the doc of AccumulatorV2.add(). Based on that, it seems reasonable to compute a max in this function rather than a sum.

I think my comment is a nitpick though, and do not want to block this PR unnecessarily as long as it is clear in the code that this function must be called only once on task completion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only 2c for keeping it as is that we can probably consolidate such metrics into a single generic implementation potentially.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment that this is not meant to be an additive metric. I guess I am more leaning towards add setting the value, and throwing if it already had set it. E.g. we don't have a case where we want increment, and we don't think max makes sense within this scope (only at merge time). I'd throw to be sure we don't shoot ourselves in the foot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the API is not ideal, but it is what we have to work with. Also if you think of it as a collection of measurements, then add is appending a new measurement instead of adding a number to a single value. The fact that we reduce it to a single value to represent the collection is separate.

But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @revans2 on the exception being a bad idea.

I am back at what does add mean here. I think it means what @jihoonson suggested, it's a max, that's how you add it because the input is not meant to be additive, it is meant to be the max, always.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm going to leave it as is for now for expediency given that I have the approvals and I'm still not totally sure I want to change it, but definitely open to it in a follow up PR. I will be continuing work on related metrics so I'll have my eye on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.

Throwing an exception sounds reasonable to me because we will be able to catch such cases if Spark ever uses it in a different way than we expect. If we are not 100% sure whether Spark does or not in all cases, do we need the testing suggested above anyway? If add() is ever called more than once, computing a sum instead of a max in it will likely mess up the metric.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I approved the max version. Not the sum version. These metrics can get to be kind of complicated because they are produced on the worker and then sent back to the driver. On the driver they are written out ot the log, but then also something has to put them into the UI. I don't think that they use the accumulator for that, and that is where my questions are. I just have not looked at it enough. A max of a max is not a big deal, in reporting what we want to see if it does happen to be called multiple times. But doing a sum of a max might cause issues. They would likely be minor issues but all the same. If you want to learn what the code is doing then lets read it and do some experiments. If we want to ship it to production, then lets be defensive with the code that we ship.

@zpuller
Copy link
Collaborator Author

zpuller commented Sep 26, 2024

build

@zpuller zpuller merged commit c047707 into NVIDIA:branch-24.10 Sep 26, 2024
45 checks passed
@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 1, 2024
@zpuller zpuller deleted the device_metrics branch October 4, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants