Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI test failures #55

Closed
wants to merge 26 commits into from
Closed

Fix CI test failures #55

wants to merge 26 commits into from

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Nov 18, 2024

The unit test workflow seems to hang needs to be fixed: https://github.com/pytorch-labs/tritonbench/actions/runs/11898546601/job/33155282740

This PR rewrites the unit test function to run each test in an individual subprocess.

@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 18, 2024 21:43 — with GitHub Actions Inactive
@xuzhao9 xuzhao9 changed the title Fix testing failure Fix CI test failures Nov 18, 2024
@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 19, 2024 00:37 — with GitHub Actions Inactive
@@ -110,7 +110,7 @@ def triton_flash_v2(
triton_q, triton_k, triton_v = self.triton_preprocess(q, k, v)
# full fp8 will be enabled if type of q,k,v is fp8
return lambda: triton_attention(
triton_q, triton_k, triton_v, False, self.sm_scale
triton_q, triton_k, triton_v, False, self.sm_scale, "base"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @manman-ren attention_opt will compile error on the pytorch version of Triton, does it require the latest Triton main branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the breakage. What is the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manman-ren here is the error message: https://github.com/pytorch-labs/tritonbench/actions/runs/11903695593/job/33171153429?pr=55. By default we are using the pytorch built-in Triton in the CI.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -29,9 +32,6 @@ jagged_layer_norm:
jagged_mean:
jagged_softmax:
jagged_sum:
layer_norm:
low_mem_dropout:
rms_norm:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liger kernels require triton package rather than pytorch-triton. I assume triton is not conflict with pytorch-triton because pytorch-triton doesn't cover import triton. I tested in local environment and it works well. but not sure if this is a safe way to do so.

Copy link
Contributor Author

@xuzhao9 xuzhao9 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to have separate tests for triton main and pytorch-triton. Our docker has two conda environments, pytorch and triton-main, so that they can be tested in the same docker.

Right now, we are only deploying tests against pytorch-triton. We will setup the triton main config as skip_tests_h100_triton_main.yaml.

Copy link
Member

@FindHao FindHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in d9633be.

@xuzhao9 xuzhao9 deleted the xz9/skip-gather-gemv branch November 19, 2024 21:10
facebook-github-bot pushed a commit that referenced this pull request Nov 19, 2024
Summary:
As the test isolation is implemented in #55, we can now enable more operators in the CI.

Pull Request resolved: #56

Reviewed By: FindHao

Differential Revision: D66189246

Pulled By: xuzhao9

fbshipit-source-id: 22f01b2e5b64956f6e2985f87be785efc977e46b
xuzhao9 added a commit that referenced this pull request Nov 20, 2024
Summary:
As the test isolation is implemented in #55, we can now enable more operators in the CI.

Pull Request resolved: #56

Reviewed By: FindHao

Differential Revision: D66189246

Pulled By: xuzhao9

fbshipit-source-id: 22f01b2e5b64956f6e2985f87be785efc977e46b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants