Skip to content

Conversation

@thc1006
Copy link

@thc1006 thc1006 commented Oct 25, 2025

Description

This PR adds ARM64 architecture support to the integration test suite, enabling integration tests to run on both amd64 and arm64 architectures where technically feasible.

Motivation

ARM64 images are widely used in production environments, and currently integration tests only run on amd64. This creates a gap in test coverage that could lead to architecture-specific issues going undetected.

Changes

Modified Jobs

  1. integration - Extended to run on both architectures using matrix strategy
  2. integration-configs-db - Added matrix strategy to test on both amd64 and arm64

Implementation Details

  • GitHub Actions runner labels: ubuntu-24.04 (amd64) and ubuntu-24.04-arm (arm64)
  • Dynamic CORTEX_IMAGE selection based on matrix.arch variable
  • Uses existing multi-arch Docker images already built by the Makefile
  • Added fail-fast: false to ensure complete test coverage across all architectures
  • Adjusted timeouts to accommodate ARM64 execution characteristics

Test Coverage

5 of 8 integration test suites now run on both architectures:

  • ✅ requires_docker
  • ✅ integration_alertmanager
  • ✅ integration_memberlist
  • ✅ integration_ruler
  • ✅ integration_remote_write_v2

3 test suites run on AMD64 only:

See commit 368828e for detailed technical reasoning behind ARM64 test exclusions.

Testing

  • YAML syntax validated successfully
  • No changes to existing amd64 test behavior (backward compatible)
  • Leverages existing ARCHS = amd64 arm64 definition in Makefile
  • All 5 ARM64 integration suites pass consistently

Notes

  • ARM64 runners are now generally available for public repositories at no additional cost
  • All existing tests remain unchanged; ARM64 tests are additive only
  • Build tag cleanup included (removed deprecated // +build tags from 40 files)

Fixes #6897

@dosubot dosubot bot added the ci/cd label Oct 25, 2025
@thc1006 thc1006 force-pushed the add-arm64-integration-tests branch from 7e3bd5d to 64bceac Compare October 25, 2025 20:51
@thc1006 thc1006 force-pushed the add-arm64-integration-tests branch 2 times, most recently from a9e3e5d to ce1d513 Compare October 30, 2025 03:53
@pull-request-size pull-request-size bot added size/L and removed size/M labels Nov 1, 2025
This commit adds ARM64 runner support to the CI pipeline to ensure
integration tests run on both amd64 and arm64 architectures, as ARM64
images are widely used in production.

Changes:
- Add matrix strategy to integration job with separate runners for
  amd64 (ubuntu-24.04) and arm64 (ubuntu-24.04-arm)
- Dynamically set CORTEX_IMAGE based on matrix.arch variable
- Add matrix strategy to integration-configs-db job for both architectures
- Add appropriate timeouts to accommodate ARM64 test execution times
- Set fail-fast: false to ensure all architecture tests complete

All existing amd64 tests remain unchanged, and ARM64 tests use the
same test suites with architecture-appropriate Docker images.

Fixes cortexproject#6897

Signed-off-by: thc1006 <[email protected]>
The script was hardcoded to download x86_64 Docker binaries, causing
"Exec format error" on ARM64 runners. This commit adds architecture
detection to download the appropriate binaries for both amd64 and arm64.

Changes:
- Add architecture detection using uname -m
- Map system architecture to Docker download paths (x86_64/aarch64)
- Map architecture to buildx binary names (amd64/arm64)
- Add informative echo to show detected architecture
- Add error handling for unsupported architectures

This fix is required for ARM64 integration tests to run successfully.

Signed-off-by: thc1006 <[email protected]>
These tests fail on ARM64 runners and should only execute on AMD64:

## integration_backward_compatibility

Old Cortex versions (v1.13.1, v1.13.2, v1.14.0) were released before
ARM64 support was added in v1.14.1 and do not have ARM64 Docker images.

When Docker attempts to run these amd64-only images on ARM64 runners via
QEMU emulation, they crash with a fatal Go runtime error:
  "runtime: lfstack.push invalid packing ... fatal error: lfstack.push"

This is a known issue with Go binaries and QEMU emulation (golang/go#69255).

While v1.14.1+ versions do have ARM64 images, skipping the entire test
on ARM64 is simpler and sufficient since backward compatibility testing
validates protocol compatibility, which is architecture-agnostic.

## integration_query_fuzz

This fuzzy testing suite compares query results between Cortex v1.18.1
and the current version. Although v1.18.1 has ARM64 support, the test
produces inconsistent results on ARM64 (NaN value mismatches), likely
due to floating-point arithmetic differences between architectures.

## integration_querier

One specific subtest fails on ARM64:
  TestQuerierWithBlocksStorageRunningInSingleBinaryMode/
    blocks_sharding_enabled,_redis_index_cache,_bucket_index_enabled,thanosEngine=true

Error: "unable to find metrics [thanos_store_index_cache_requests_total]
with expected values. Last values: [36]"

This appears to be a timing-sensitive test where the exact number of cache
requests differs between ARM64 and AMD64 runners, likely due to performance
characteristics or subtle behavioral differences in the Thanos store gateway.

## Testing Coverage

All other ARM64 integration tests (5 test suites) pass successfully:
- requires_docker
- integration_alertmanager
- integration_memberlist
- integration_ruler
- integration_remote_write_v2

This provides comprehensive validation of core Cortex functionality
on ARM64 architecture while avoiding known compatibility and timing
issues with historical and edge-case testing scenarios.

Fixes cortexproject#6897

Signed-off-by: thc1006 <[email protected]>
Removed deprecated `// +build` build constraint comments from 40 files.
These are no longer needed as `//go:build` directives are now used
exclusively as per Go 1.17+ requirements.

This fixes golangci-lint buildtag errors detected with newer linter
versions on ARM64 platform.

Files modified:
- 37 integration test files
- 3 pkg/configs/db/dbtest files

Signed-off-by: thc1006 <[email protected]>
@thc1006 thc1006 force-pushed the add-arm64-integration-tests branch from a387cc5 to aff13a8 Compare November 1, 2025 08:11
@thc1006
Copy link
Author

thc1006 commented Nov 1, 2025

Update: This PR has been rebased onto the latest master branch, which now includes the fix for TestBlocksCleaner_ShouldRemoveBlocksOutsideRetentionPeriod from PR #7082.

Current status:

All ARM64-specific functionality has been verified locally. The PR is ready for review and CI approval.

Thank you for your patience.

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. Just 2 minor nits

@thc1006
Copy link
Author

thc1006 commented Nov 4, 2025

Re: integration_querier on ARM64

Thank you for taking the time to review this!

TLDR: I skipped this on ARM64 because the test expects exact cache request counts (like 36), but ARM64 gets different numbers due to timing. The querier itself works correctly on ARM64.

Background

In commit 368828e, I intentionally skipped this test. The specific subtest that fails is:

TestQuerierWithBlocksStorageRunningInSingleBinaryMode/.../thanosEngine=true

It expects exactly 36 thanos_store_index_cache_requests_total, but ARM64 consistently gets different values. From what I can see, this appears to be timing-related rather than a functional issue - the querier works fine on ARM64, just the hard-coded assertion doesn't match.

Would appreciate your thoughts

I understand this doesn't fully align with the "same tests" goal from issue #6897. I can either:

  • Leave it as is (the other 5 integration suites do validate ARM64 functionality)
  • Fix the test to use ranges instead of exact counts (would take me a couple of days, but I'm happy to do it if you feel it's important)

Please let me know which direction you'd prefer. Really appreciate your guidance on this!


Re: integration_backward_compatibility on ARM64

Thank you for the feedback!

TLDR: The old Cortex versions (v1.13.1-v1.14.0) this test uses don't have ARM64 images and crash under emulation. Since backward compatibility is protocol-level rather than architecture-specific, I skipped this on ARM64.

The issue

In commit 368828e, I skipped this test because it validates against old Cortex versions (v1.13.1, v1.13.2, v1.14.0) that were released before ARM64 support was added in v1.14.1.

When Docker attempts to run these amd64-only images on ARM64 runners via QEMU emulation, they crash with:

runtime: lfstack.push invalid packing ... fatal error: lfstack.push

This is a known issue with Go binaries under QEMU (golang/go#69255).

My reasoning

I felt that backward compatibility testing validates protocol-level compatibility, which shouldn't vary by architecture - if the protocol works on AMD64, it should work identically on ARM64. But I completely understand if you see this differently.

If you'd like, I could modify the test to only validate v1.14.1+ on ARM64 (which do have ARM64 images). It would provide partial coverage, though it wouldn't test the oldest versions.

Please let me know your thoughts - I'm happy to adjust the approach based on what you think makes most sense for the project.

@friedrichg
Copy link
Member

@thc1006 Thanks for the patience!

In commit 368828e, I >intentionally skipped this test. The specific subtest that fails is:

TestQuerierWithBlocksStorageRunningInSingleBinaryMode/.../thanosEngine=true

You could skip any tests that don't run in arm64 with something like (or similar).

if runtime.GOARCH != "amd64" {
      t.Skip("Skipping test: only runs on amd64")
  }

These can be fixed in a follow up PR

In commit 368828e, I skipped >this test because it validates against old Cortex versions (v1.13.1, v1.13.2, v1.14.0) that were released before ARM64 support > was added in v1.14.1.

per https://cortexmetrics.io/docs/configuration/v1guarantees/#flags-config-and-minor-version-upgrades , we only need to support v1.18.0, v1.17.0, v1.16.0. You can remove support for v1.13.X and v1.14.0

thc1006 added a commit to thc1006/cortex that referenced this pull request Nov 5, 2025
…ARM64

This commit addresses reviewer feedback to enable these two test suites
on ARM64 architecture while maintaining test reliability.

## Changes

### integration_querier
- Added runtime.GOARCH skip for Thanos engine subtests on non-amd64
- Allows the test suite to run on ARM64, skipping only timing-sensitive
  subtests that check exact cache request counts
- These assertions vary across architectures due to performance differences

### integration_backward_compatibility
- Removed support for Cortex v1.13.x-v1.15.x (11 versions)
- Retained only v1.16.0+ (7 versions with ARM64 support)
- Per https://cortexmetrics.io/docs/configuration/v1guarantees/, only
  the last 3 minor versions need backward compatibility testing
- All retained versions have ARM64 Docker images available

### Workflow updates
- Added integration_querier and integration_backward_compatibility to ARM64 matrix
- Updated Docker image preloading to match retained versions
- Added v1.19.0 to preload list

## Result
ARM64 test coverage increases from 5/8 to 7/8 integration test suites.
Only integration_query_fuzz remains ARM64-exclusive due to known issue cortexproject#6982.

Addresses: cortexproject#7068 (comment)
…ARM64

This commit addresses reviewer feedback to enable these two test suites
on ARM64 architecture while maintaining test reliability.

## Changes

### integration_querier
- Added runtime.GOARCH skip for Thanos engine subtests on non-amd64
- Allows the test suite to run on ARM64, skipping only timing-sensitive
  subtests that check exact cache request counts
- These assertions vary across architectures due to performance differences

### integration_backward_compatibility
- Removed support for Cortex v1.13.x-v1.15.x (11 versions)
- Retained only v1.16.0+ (7 versions with ARM64 support)
- Per https://cortexmetrics.io/docs/configuration/v1guarantees/, only
  the last 3 minor versions need backward compatibility testing
- All retained versions have ARM64 Docker images available

### Workflow updates
- Added integration_querier and integration_backward_compatibility to ARM64 matrix
- Updated Docker image preloading to match retained versions
- Added v1.19.0 to preload list

## Result
ARM64 test coverage increases from 5/8 to 7/8 integration test suites.
Only integration_query_fuzz remains ARM64-exclusive due to known issue cortexproject#6982.

Addresses: cortexproject#7068 (comment)
Signed-off-by: thc1006 <[email protected]>
@thc1006 thc1006 force-pushed the add-arm64-integration-tests branch from 4a6bf6f to 385294d Compare November 5, 2025 02:32
@thc1006
Copy link
Author

thc1006 commented Nov 5, 2025

Hi @friedrichg,

TLDR: I've implemented both changes you suggested. ARM64 coverage now increases from 5/8 to 7/8 test suites.

Changes made

integration_querier:
Added runtime.GOARCH skip for Thanos engine subtests as you suggested. The test suite now runs on ARM64, skipping only the timing-sensitive subtests.

integration_backward_compatibility:
Removed v1.13.x-v1.15.x support and retained only v1.16.0+ versions (which all have ARM64 images). Per the v1 guarantees documentation you linked, this covers the last 3 minor versions.

Workflow:
Added both test suites to the ARM64 matrix and updated Docker image preloading accordingly.

Thank you for the clear guidance - it made the implementation straightforward. Please let me know if you'd like any adjustments!

Latest commit: 385294d

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/cd lgtm This PR has been approved by a maintainer size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run tests with arm64 architecture

2 participants