GitHub Actions CI using EC2 GPU nodes #771
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds 4 GitHub Actions workflows that install or test
mamba_ssmon EC2 GPU nodes (using Open-Athena/ec2-gha):install.yaml: installmamba_ssmon an EC2 GPU instance (default:g4dn.xlarge)installs.yaml: runinstall.yamlon 6 recent versions of Mamba (2.2.{0,1,2,3post2,4,5})test.yaml: runmamba_ssmtests on an EC2 GPU instance (g5org6series)tests.yaml: runtest.yamlon HEAD, on ag5.2xlargeandg6.2xlargeExample runs
installs#12
tests#4
Test failures (
bfloat16precision)Both g5.2xlarge (A10G) and g6.2xlarge (L4) runs exhibited some bfloat16 precision failures with the original tolerances.
Resolution: Tests now pass with relaxed tolerances:
test_selective_state_update_with_batch_indices: rtol=0.09, atol=0.096 (was rtol=0.06, atol=0.06)test_chunk_state_varlen: rtol=0.01, atol=0.006 (was rtol=0.01, atol=0.003)Original failure details
g5.2xlarge (A10G) - 2 failures
test_selective_state_update_with_batch_indices[2048-64-True-itype2](rtol=0.06, atol=0.06)expected=1.156, got=1.242, abs_diff=0.086, rel_diff=7.4%expected=0.027, got=0.090, abs_diff=0.063, rel_diff=233%test_chunk_state_varlen[128-1-dtype2](rtol=0.01, atol=0.003)g6.2xlarge (L4) - 3 failures
test_selective_state_update_with_batch_indices[2064-32-True-itype2](rtol=0.06, atol=0.06)expected=0.318, got=0.236, abs_diff=0.082, rel_diff=25.8%test_selective_state_update_with_batch_indices[2064-64-True-itype2](rtol=0.06, atol=0.06)expected=0.006, got=-0.089, abs_diff=0.095, rel_diff=1583%(near-zero expected)expected=-1.109, got=-1.039, abs_diff=0.070, rel_diff=6.3%expected=0.957, got=0.887, abs_diff=0.070, rel_diff=7.3%test_selective_state_update_with_batch_indices[4096-64-True-itype2](rtol=0.06, atol=0.06)expected=-0.176, got=-0.250, abs_diff=0.074, rel_diff=42.0%These failures affected only 0.0015-0.012% of tensor elements and are within expected bfloat16 precision limits.
Installation issues
Installing without
--no-build-isolationpip install mamba_ssm==2.2.5(sans--no-build-isolation) succeeds, but older versions fail (cf. install#13)Pre-built wheels / PyTorch compatibility
I learned that it's important to get pre-built
mamba_ssmwheels (from GitHub Releases; they're not on PyPI):pip install 2.2.5job took 3m48s on 8/6, 52m on 8/8Motivation
I originally hit issues
pip installingmamba_ssmon EC2 GPU nodes, and wanted to understand this comment better:I made Open-Athena/ec2-gha for easier testing/verifying/MREs, and used it here in 2 GHAs.
Setup
I've set these GHA variables (on Open-Athena, but repo-level also OK):
See also example config scripts.