Conversation
32723a7 to
702238f
Compare
|
@pguthrey is this ready for review? |
Not just yet. Need to experiment more with different implementations. Will come back to this later. |
050da7e to
c9b19a6
Compare
|
Here are the results of these changes. Good improvement for CUDA. Impossibly good improvement for HIP. I checked that the results are the same as the previous algorithm... but I might look more into what is going on with HIP.
|
|
Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened. |
That makes some sense. If I add the memory needed by the vectors and the matrix together I get |
|
Updated results after fixing a bug.
|
|
It actually made cuda slower? |
| const rajaperf::Real_type detj_tol, | ||
| const rajaperf::Int_type quad_type, | ||
| const rajaperf::Int_type quad_order, | ||
| rajaperf::Real_type (&matrix)[EB][EB]) |
There was a problem hiding this comment.
There is still a full matrix?
There was a problem hiding this comment.
The interface that this is used for requires having a full matrix. However, when we we are computing the work at each quadrature point, we are using a symmetric matrix.
There was a problem hiding this comment.
I can follow up with a study on how beneficial it would be to never create the full matrix. If that is a major impact, that may be enough incentive to rewrite how things are done in the ultimate use case.
There was a problem hiding this comment.
Did a quick study measuring the performance of an impl that has no 12x12 matrices. Its indistinguishable from the performance seen in this MR.
I would not say that is a statistically significant difference. (could be within sampling tolerance) |
|
Summary