Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance curve of parallel GEMM with many cores shows significant up-down #5172

Open
nakagawa-fj opened this issue Mar 7, 2025 · 4 comments · May be fixed by #5173
Open

The performance curve of parallel GEMM with many cores shows significant up-down #5172

nakagawa-fj opened this issue Mar 7, 2025 · 4 comments · May be fixed by #5173

Comments

@nakagawa-fj
Copy link

When we measured the performance of SGEMM on ARM Neoverse V1 with 64 cores,
the performance values were not smooth and significant up-down, as shown in the following graph.
For example, the performance drops about 35% at m=n=k=1100, compared to the points before and after.
The significant fluctuations in performance values depend on the number of parallel threads and data size,
leading me to believe that the issue causes in the thread partitioning control. I am planning to modify it.

Image

@martin-frbg
Copy link
Collaborator

Thanks. You are probably aware of yamazakimitsufumi's previous work in #4655 ?

@martin-frbg
Copy link
Collaborator

Also note that the fix in #5133 for a misguided change I had made in #4920 was not in 0.3.29 yet - this could cause extra slowdown and made the IBM folks really unhappy

@nakagawa-fj
Copy link
Author

Thanks. You are probably aware of yamazakimitsufumi's previous work in #4655 ?

Yes, I know about yamazakimitsufumi's work #4655.

Also note that the fix in #5133 for a misguided change I had made in #4920 was not in 0.3.29 yet - this could cause extra slowdown and made the IBM folks really unhappy

Thanks for your notice. I have been aware of the fixes in #5133 and #4920 from this work. I have been surprised to hear #4920 had an impact at IBM. Once my modify is ready, it might also need to be reviewed by IBM as well.

@brada4
Copy link
Contributor

brada4 commented Mar 9, 2025

involved data size is 12MB, likely some cache is exceeded.
Can you overlap sawttoth from full 2**n core number space?

@nakagawa-fj nakagawa-fj linked a pull request Mar 11, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants