WIP: EDGE3D update by pguthrey · Pull Request #478 · llnl/RAJAPerf

pguthrey · 2024-09-09T20:32:06Z

Summary

This PR is a refactoring
It does the following:
- Explores performance improvements to EDGE3D

rhornung67 · 2024-10-10T17:53:38Z

@pguthrey is this ready for review?

pguthrey · 2024-10-13T17:21:32Z

@pguthrey is this ready for review?

Not just yet. Need to experiment more with different implementations. Will come back to this later.
Also - there are additional apps that are broken with respect to targetopenmp, so I think I will pull that out into another branch and work with others to fix those examples (unless this is no longer supported)

pguthrey · 2024-11-27T02:37:48Z

Here are the results of these changes. Good improvement for CUDA. Impossibly good improvement for HIP. I checked that the results are the same as the previous algorithm... but I might look more into what is going on with HIP.

CUDA	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
	default	default	default	default	default	default	block_256	block_256	block_256
current impl	4.42E+01	4.45E+01	4.44E+01	8.98E-01	9.08E-01	9.04E-01	1.53E-01	1.53E-01	1.54E-01
new impl	3.05E+01	3.06E+01	3.06E+01	6.08E-01	6.11E-01	6.11E-01	7.06E-02	7.02E-02	7.04E-02
speedup current/new	1.5	1.5	1.5	1.5	1.5	1.5	2.2	2.2	2.2
speedup openmp/kernel				1.0	1.0	1.0	8.6	8.7	8.6

HIP	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
	default	default	default	default	default	default	block_256	block_256	block_256
current impl	1.74E+01	1.74E+01	1.74E+01	4.36E-01	3.79E-01	4.44E-01	1.34E-01	9.64E-02	9.73E-02
new impl	1.56E+01	1.54E+01	1.54E+01	3.51E-01	3.70E-01	4.01E-01	1.03E-03	9.90E-05	9.10E-05
speedup current/new	1.1	1.1	1.1	1.2	1.0	1.1	130.8	974.0	1069.0
speedup new openmp/kernel				1.0	0.9	0.9	341.7	3541.4	3852.7

MrBurmark · 2024-11-27T16:30:39Z

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

pguthrey · 2024-11-27T19:35:45Z

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

That makes some sense. If I add the memory needed by the vectors and the matrix together I get

12*13/2 + 3*12 + 3*12 = 150 > 128

pguthrey · 2024-12-31T01:03:24Z

Updated results after fixing a bug.

ATS2	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_CUDA	Lambda_CUDA	RAJA_CUDA
variant	default	default	default	default	default	default	block_256	block_256	block_256
current impl	4.42E+01	4.45E+01	4.44E+01	8.98E-01	9.08E-01	9.04E-01	1.53E-01	1.53E-01	1.54E-01
new impl	3.18E+01	3.18E+01	3.18E+01	6.35E-01	6.40E-01	6.41E-01	1.63E-01	1.63E-01	1.63E-01
speedup current/new	1.4	1.4	1.4	1.4	1.4	1.4	0.9	0.9	0.9
speedup openmp/kernel	.	.	.	1.0	1.0	1.0	3.9	3.9	3.9

ATS4	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
variant	default	default	default	default	default	default	block_256	block_256	block_256
current impl	1.74E+01	1.74E+01	1.74E+01	4.36E-01	3.79E-01	4.44E-01	1.34E-01	9.64E-02	9.73E-02
new impl	15.460	15.529	15.517	0.288	0.170	0.166	0.006	0.005	0.005
speedup current/new	1.1	1.1	1.1	1.5	2.2	2.7	21.6	18.4	18.7
speedup new openmp/kernel	.	.	.	1.0	1.7	1.7	46.3	55.1	55.3

MrBurmark · 2024-12-31T04:26:51Z

It actually made cuda slower?

MrBurmark · 2024-12-31T04:30:34Z

src/apps/EDGE3D.hpp

+  const rajaperf::Real_type  detj_tol,
+  const rajaperf::Int_type   quad_type,
+  const rajaperf::Int_type   quad_order,
+  rajaperf::Real_type        (&matrix)[EB][EB])


There is still a full matrix?

The interface that this is used for requires having a full matrix. However, when we we are computing the work at each quadrature point, we are using a symmetric matrix.

I can follow up with a study on how beneficial it would be to never create the full matrix. If that is a major impact, that may be enough incentive to rewrite how things are done in the ultimate use case.

Did a quick study measuring the performance of an impl that has no 12x12 matrices. Its indistinguishable from the performance seen in this MR.

pguthrey · 2024-12-31T06:46:31Z

It actually made cuda slower?

I would not say that is a statistically significant difference. (could be within sampling tolerance)

pguthrey · 2025-01-03T01:45:31Z

Rows	CPU runs	CPU runs	Raw CUDA/HIP	RAJA
ATS2	Base_Seq	RAJA_Seq	Base_CUDA	RAJA_CUDA
current impl	4.42E+01	4.44E+01	1.53E-01	1.54E-01
new impl	3.18E+01	3.18E+01	1.63E-01	1.63E-01
speedup current/new	1.4	1.4	0.9	0.9
speedup gpu v 1 cpu			195.0	194.7
ATS4	Base_Seq	RAJA_Seq	Base_HIP	RAJA_HIP
current impl	1.74E+01	1.74E+01	1.34E-01	9.73E-02
new impl	15.460	15.517	0.006	0.005
speedup current/new	1.1	1.1	21.6	18.7
speedup gpu v 1 cpu			2485.5	2983.5

pguthrey force-pushed the feature/guthrey1/edge_3d_update branch 3 times, most recently from 32723a7 to 702238f Compare September 11, 2024 22:29

rhornung67 requested review from MrBurmark, artv3 and rhornung67 October 10, 2024 17:56

pguthrey removed request for MrBurmark, artv3 and rhornung67 October 13, 2024 17:17

pguthrey marked this pull request as draft October 13, 2024 17:18

use symmetry for storage on innermost loop (but not outermost)

c9b19a6

pguthrey force-pushed the feature/guthrey1/edge_3d_update branch from 050da7e to c9b19a6 Compare November 23, 2024 03:46

pguthrey added 2 commits November 22, 2024 19:46

comments

3298c9a

call symmetric impl

5e593d5

pguthrey added 2 commits December 30, 2024 15:12

correction, allow for old impl

6fe5181

whoops

d929e14

pguthrey added 2 commits December 30, 2024 17:06

rm old code

6fd5a72

comment

9fea351

pguthrey self-assigned this Dec 31, 2024

pguthrey requested a review from MrBurmark December 31, 2024 01:09

MrBurmark reviewed Dec 31, 2024

View reviewed changes

Conversation

pguthrey commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

rhornung67 commented Oct 10, 2024

Uh oh!

pguthrey commented Oct 13, 2024

Uh oh!

pguthrey commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Nov 27, 2024

Uh oh!

pguthrey commented Nov 27, 2024

Uh oh!

pguthrey commented Dec 31, 2024

Uh oh!

MrBurmark commented Dec 31, 2024

Uh oh!

MrBurmark Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

Uh oh!

pguthrey commented Dec 31, 2024

Uh oh!

pguthrey commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pguthrey commented Sep 9, 2024 •

edited

Loading

pguthrey commented Nov 27, 2024 •

edited

Loading

pguthrey commented Jan 3, 2025 •

edited

Loading