You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: Monitoring_Tools_on_Graviton.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -108,7 +108,7 @@ One can collect hardware events/ counters for an application, on a specific CPU,
108
108
More details on how to use Linux perf utility on AWS Graviton processors is available [here](https://github.com/aws/aws-graviton-getting-started/blob/main/optimizing.md#profiling-the-code).
109
109
110
110
## Summary: Utilities on AWS Graviton vs. Intel x86 architectures
111
-
|Processor |x86 |Graviton2,3 |
111
+
|Processor |x86 |Graviton2,3, and 4|
112
112
|--- |--- |--- |
113
113
|CPU frequency listing |*lscpu, /proc/cpuinfo, dmidecode*|*dmidecode*|
114
114
|*turbostat* support |Yes |No |
@@ -117,12 +117,12 @@ More details on how to use Linux perf utility on AWS Graviton processors is avai
Utilities such as *lmbench* are available [here](http://lmbench.sourceforge.net/) and can be built for AWS Graviton processors to obtain latency and bandwidth stats.
123
123
124
124
**Notes**:
125
125
126
126
**1.** The ARM Linux kernel community has decided not to put CPU frequency in _/proc/cpuinfo_ which can be read by tools such as _lscpu_ or directly.
127
127
128
-
**2.** On AWS Graviton 2/3 processors, Turbo isn’t supported. So, utilities such as ‘turbostat’ aren’t supported/ relevant for Arm architecture (and not on AWS Graviton processor either). Also, tools such as *[i7z](https://code.google.com/archive/p/i7z/)* for discovering CPU frequency, turbo, sockets and other information are only supported on Intel architecture/ processors. Intel *MLC* is a memory latency checker utility that is only supported on Intel processors.
128
+
**2.** On AWS Graviton processors, Turbo isn’t supported. So, utilities such as ‘turbostat’ aren’t supported/ relevant. Also, tools such as *[i7z](https://code.google.com/archive/p/i7z/)* for discovering CPU frequency, turbo, sockets and other information are only supported on Intel architecture/ processors. Intel *MLC* is a memory latency checker utility that is only supported on Intel processors.
GCC's `-moutline-atomics` flag produces a binary that runs on both Graviton and
91
-
Graviton2. Supporting both platforms with the same binary comes at a small
93
+
GCC's `-moutline-atomics` flag produces a binary that runs on both Graviton1 and later
94
+
Gravitons with LSE support. Supporting both platforms with the same binary comes at a small
92
95
extra cost: one load and one branch. To check that an application
93
96
has been compiled with `-moutline-atomics`, `nm` command line utility displays
94
97
the name of functions and global variables in an application binary. The boolean
@@ -152,16 +155,16 @@ if (feof(stdin)) {
152
155
}
153
156
```
154
157
155
-
### Using Graviton2 Arm instructions to speed-up Machine Learning
158
+
### Using Arm instructions to speed-up Machine Learning
156
159
157
-
Graviton2 processors been optimized for performance and power efficient machine learning by enabling [Arm dot-product instructions](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/exploring-the-arm-dot-product-instructions) commonly used for Machine Learning (quantized) inference workloads, and enabling [Half precision floating point - \_float16](https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-intrinsics) to double the number of operations per second, reducing the memory footprint compared to single precision floating point (\_float32), while still enjoying large dynamic range.
160
+
Graviton2 and later processors been optimized for performance and power efficient machine learning by enabling [Arm dot-product instructions](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/exploring-the-arm-dot-product-instructions) commonly used for Machine Learning (quantized) inference workloads, and enabling [Half precision floating point - \_float16](https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-intrinsics) to double the number of operations per second, reducing the memory footprint compared to single precision floating point (\_float32), while still enjoying large dynamic range.
158
161
159
162
### Using SVE
160
163
161
164
The scalable vector extensions (SVE) require both a new enough tool-chain to
162
165
auto-vectorize to SVE (GCC 11+, LLVM 14+) and a 4.15+ kernel that supports SVE.
163
166
One notable exception is that Amazon Linux 2 with a 4.14 kernel doesn't support SVE;
164
-
please upgrade to a 5.4+ AL2 kernel.
167
+
please upgrade to a 5.4+ AL2 kernel. Graviton3 and Graviton4 support SVE, earlier Gravitons does not.
165
168
166
169
### Using Arm instructions to speed-up common code sequences
167
170
The Arm instruction set includes instructions that can be used to speedup common
Copy file name to clipboardexpand all lines: dpdk_spdk.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,18 @@
1
-
# DPDK, SPDK, ISA-L supports Graviton2
1
+
# DPDK, SPDK, ISA-L supports Graviton
2
2
3
-
Graviton2 is optimized for data path functions like networking and storage. Users of [DPDK](https://github.com/dpdk/dpdk) and [SPDK](https://github.com/spdk/spdk) can download and compile natively on Graviton2 following the normal installation guidelines from the respective repositories linked above.
3
+
Graviton2 and later CPUs are optimized for data path functions like networking and storage. Users of [DPDK](https://github.com/dpdk/dpdk) and [SPDK](https://github.com/spdk/spdk) can download and compile natively on Graviton following the normal installation guidelines from the respective repositories linked above.
4
4
5
5
**NOTE**: *Though DPDK precompiled packages are available from Ubuntu but we recommend building them from source.*
6
6
7
-
SPDK relies often on [ISA-L](https://github.com/intel/isa-l) which is already optimized for Arm64 and the CPU cores in Graviton2.
7
+
SPDK relies often on [ISA-L](https://github.com/intel/isa-l) which is already optimized for Arm64 and the CPU cores in Graviton2 and later processors.
8
8
9
9
10
10
11
11
## Compile DPDK from source
12
12
13
13
[DPDK official guidelines](https://doc.dpdk.org/guides/linux_gsg/build_dpdk.html) requires using *meson* and *ninja* to build from source code.
14
14
15
-
A native compilation of DPDK on top of Graviton2 will generate optimized code that take advantage of the CRC and Crypto instructions in Graviton2 cpu cores.
15
+
A native compilation of DPDK on top of Graviton will generate optimized code that take advantage of the CRC and Crypto instructions in Graviton2 and later cpu cores.
16
16
17
17
**NOTE**: Some of the installations steps call "python" which may not be valid command in modern linux distribution, you may need to install *python-is-python3* to resolve this.
18
18
@@ -35,5 +35,5 @@ Some application, written with the x86 architecture in mind, set the active dpdk
35
35
36
36
## Known issues
37
37
38
-
***testpmd:** The flowgen function of testpmd does not work correctly when compiled with GCC 9 and above. It generates IP packets with wrong checksum which are dropped when transmitted between AWS instances (including Graviton2). This is a known issue and there is a [patch](https://patches.dpdk.org/patch/84772/) that fixes it.
38
+
***testpmd:** The flowgen function of testpmd does not work correctly when compiled with GCC 9 and above. It generates IP packets with wrong checksum which are dropped when transmitted between AWS instances (including Graviton). This is a known issue and there is a [patch](https://patches.dpdk.org/patch/84772/) that fixes it.
Copy file name to clipboardexpand all lines: managed_services.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ Note: You can always find the latest Graviton announcements via these [What's Ne
5
5
Service | Status | Resources |
6
6
:-: | :-: | --- |
7
7
[AWS App Mesh](https://aws.amazon.com/app-mesh/) | GA | What's New: [AWS App Mesh now supports ARM64-based Envoy Images](https://aws.amazon.com/about-aws/whats-new/2021/11/aws-app-mesh-arm64-envoy-images/) |
8
-
[Amazon Aurora](https://aws.amazon.com/rds/aurora/) | GA | What's New: [Achieve up to 35% better price/performance with Amazon Aurora using new Graviton2 instances](https://aws.amazon.com/about-aws/whats-new/2021/03/achieve-up-to-35-percent-better-price-performance-with-amazon-aurora-using-new-graviton2-instances/)<br>Related blog: [Key considerations in moving to Graviton2 for Amazon RDS and Amazon Aurora databases](https://aws.amazon.com/blogs/database/key-considerations-in-moving-to-graviton2-for-amazon-rds-and-amazon-aurora-databases/)<br>For supported instance types and database engine versions see [Aurora DB Instances](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.DBInstanceClass.html) |
8
+
[Amazon Aurora](https://aws.amazon.com/rds/aurora/) | GA | What's New: [Amazon Aurora MySQL and PostgreSQL support for Graviton3 based R7g instance family](https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-aurora-mysql-postgresql-graviton3-based-r7g-instance-family/), [Achieve up to 35% better price/performance with Amazon Aurora using new Graviton2 instances](https://aws.amazon.com/about-aws/whats-new/2021/03/achieve-up-to-35-percent-better-price-performance-with-amazon-aurora-using-new-graviton2-instances/)<br>Related blog: [Key considerations in moving to Graviton2 for Amazon RDS and Amazon Aurora databases](https://aws.amazon.com/blogs/database/key-considerations-in-moving-to-graviton2-for-amazon-rds-and-amazon-aurora-databases/)<br>For supported instance types and database engine versions see [Aurora DB Instances](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.DBInstanceClass.html) |
9
9
[Amazon EC2 Auto Scaling](https://aws.amazon.com/ec2/autoscaling/) | GA | What's New: [Amazon EC2 Auto Scaling announces support for multiple launch templates for Auto Scaling groups](https://aws.amazon.com/about-aws/whats-new/2020/11/amazon-ec2-auto-scaling-announces-support-for-multiple-launch-templates-for-auto-scaling-groups/)<br>Associated blog: [Supporting AWS Graviton2 and x86 instance types in the same Auto Scaling group](https://aws.amazon.com/blogs/compute/supporting-aws-graviton2-and-x86-instance-types-in-the-same-auto-scaling-group/)
10
10
[AWS Batch](https://aws.amazon.com/batch/) | GA | Blog: [Target cross-platform Go builds with AWS CodeBuild Batch builds](https://aws.amazon.com/blogs/devops/target-cross-platform-go-builds-with-aws-codebuild-batch-builds/) |
11
11
[AWS CodeBuild](https://aws.amazon.com/codebuild/) | GA | What's New: [AWS CodeBuild supports Arm-based workloads using AWS Graviton2](https://aws.amazon.com/about-aws/whats-new/2021/02/aws-codebuild-supports-arm-based-workloads-using-aws-graviton2/) |
@@ -15,8 +15,8 @@ Service | Status | Resources |
15
15
[Amazon ECS](https://aws.amazon.com/ecs/) | GA | [Amazon ECS-optimized AMIs](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html) |
16
16
[Amazon EKS](https://aws.amazon.com/eks/) | GA | What's New: [Amazon EKS support for Arm-based instances powered by AWS Graviton is now generally available](https://aws.amazon.com/about-aws/whats-new/2020/08/amazon-eks-support-for-arm-based-instances-powered-by-aws-graviton-now-generally-available/)<br>Launch Blog: [Amazon EKS on AWS Graviton2 generally available: considerations on multi-architecture apps](https://aws.amazon.com/blogs/containers/eks-on-graviton-generally-available/) |
[Amazon ElastiCache](https://aws.amazon.com/elasticache/) | GA | What's New: [Amazon ElastiCache now supports M6g and R6g Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-elasticache-now-supports-m6g-and-r6g-graviton2-based-instances/)<br> What's New: [Amazon ElastiCache now supports T4g Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-elasticache-supports-t4g-graviton2-based-instances/) |
19
-
[Amazon EMR](https://aws.amazon.com/emr/) | GA | What's New: [Amazon EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-emr-now-provides-up-to-30-lower-cost-and-up-to-15-improved-performance/)<br>Launch Blog: [Amazon EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances](https://aws.amazon.com/blogs/big-data/amazon-emr-now-provides-up-to-30-lower-cost-and-up-to-15-improved-performance-for-spark-workloads-on-graviton2-based-instances/) |
18
+
[Amazon ElastiCache](https://aws.amazon.com/elasticache/) | GA | What's New: [Amazon ElastiCache now supports M7g and R7g Graviton3-based nodes](https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-elasticache-m7g-r7g-graviton-3-nodes/), [Amazon ElastiCache now supports M6g and R6g Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-elasticache-now-supports-m6g-and-r6g-graviton2-based-instances/)<br> What's New: [Amazon ElastiCache now supports T4g Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-elasticache-supports-t4g-graviton2-based-instances/) |
19
+
[Amazon EMR](https://aws.amazon.com/emr/) | GA | What's New: [Amazon EMR now supports Amazon EC2 C7g (Graviton3) instances](https://aws.amazon.com/about-aws/whats-new/2023/03/amazon-emr-amazon-ec2-c7g-graviton3-instances/), [Amazon EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances](https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-emr-now-provides-up-to-30-lower-cost-and-up-to-15-improved-performance/)<br>Launch Blog: [Amazon EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances](https://aws.amazon.com/blogs/big-data/amazon-emr-now-provides-up-to-30-lower-cost-and-up-to-15-improved-performance-for-spark-workloads-on-graviton2-based-instances/) |
20
20
[Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) | GA | What's New: [Announcing AWS Graviton2 support for Amazon EMR Serverless - Get up to 35% better price-performance for your serverless Spark and Hive workload](https://aws.amazon.com/about-aws/whats-new/2022/11/aws-graviton2-emr-serverless-35-percent-price-performance-spark-hive-workloads/) |
21
21
[AWS Fargate](https://aws.amazon.com/fargate/) | GA | Launch Blog: [Announcing AWS Graviton2 Support for AWS Fargate – Get up to 40% Better Price-Performance for Your Serverless Containers](https://aws.amazon.com/blogs/aws/announcing-aws-graviton2-support-for-aws-fargate-get-up-to-40-better-price-performance-for-your-serverless-containers/) |
22
22
[Amazon Gamelift](https://aws.amazon.com/gamelift/) | GA | Launch Blog: [Now available: New Asia Pacific (Osaka) region and Graviton2 support for Amazon GameLift](https://aws.amazon.com/blogs/gametech/now-available-new-asia-pacific-osaka-region-and-graviton2-support-for-amazon-gamelift/)<br>Addition of Graviton3: [Announcing Amazon GameLift support for instances powered by AWS Graviton3 processors](https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-gamelift-instances-aws-graviton-3-processors/)|
@@ -25,5 +25,5 @@ Service | Status | Resources |
25
25
[Amazon Managed Streaming for Apache Kafka (MSK)](https://aws.amazon.com/msk/) | GA | What's New: [Amazon MSK now supports Graviton3-based M7g instances for new provisioned clusters](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-msk-graviton3-m7g-instances-provisioned-clusters/) |
26
26
[Amazon Neptune](https://aws.amazon.com/neptune/) | GA | What's New: [Announcing AWS Graviton2-based instances for Amazon Neptune](https://aws.amazon.com/about-aws/whats-new/2021/11/aws-graviton2-based-instances-amazon-neptune/) |
27
27
[Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/) | GA | What's New: [Amazon Elasticsearch Service now offers AWS Graviton2 (M6g, C6g, R6g, and R6gd) instances](https://aws.amazon.com/about-aws/whats-new/2021/05/amazon-elasticsearch-service-offers-aws-graviton2-m6g-c6g-r6g-r6gd-instances/)<br>Related blog: [Increase Amazon Elasticsearch Service performance by upgrading to Graviton2](https://aws.amazon.com/blogs/big-data/increase-amazon-elasticsearch-service-performance-by-upgrading-to-graviton2/)|
28
-
[Amazon RDS](https://aws.amazon.com/rds/) | GA | What's New: [Achieve up to 52% better price/performance with Amazon RDS using new Graviton2 instances](https://aws.amazon.com/about-aws/whats-new/2020/10/achieve-up-to-52-percent-better-price-performance-with-amazon-rds-using-new-graviton2-instances/)<br>Launch Blog: [New – Amazon RDS on Graviton2 Processors](https://aws.amazon.com/blogs/aws/new-amazon-rds-on-graviton2-processors/)<br>Related blog: [Key considerations in moving to Graviton2 for Amazon RDS and Amazon Aurora databases](https://aws.amazon.com/blogs/database/key-considerations-in-moving-to-graviton2-for-amazon-rds-and-amazon-aurora-databases/)<br>For supported instance types and database engine versions see [RDS DB Instances](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html) |
29
-
[Amazon SageMaker](https://aws.amazon.com/pm/sagemaker/) | GA | What's New: [Amazon SageMaker adds eight new Graviton-based instances for model deployment](https://aws.amazon.com/about-aws/whats-new/2022/10/amazon-sagemaker-adds-new-graviton-based-instances-model-deployment/) <br> Related blog: [Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/run-machine-learning-inference-workloads-on-aws-graviton-based-instances-with-amazon-sagemaker/)|
28
+
[Amazon RDS](https://aws.amazon.com/rds/) | GA | What's New: [Amazon RDS now supports M7g and R7g database instances](https://aws.amazon.com/about-aws/whats-new/2023/04/amazon-rds-m7g-r7g-database-instances/), [Achieve up to 52% better price/performance with Amazon RDS using new Graviton2 instances](https://aws.amazon.com/about-aws/whats-new/2020/10/achieve-up-to-52-percent-better-price-performance-with-amazon-rds-using-new-graviton2-instances/)<br>Launch Blog: [New – Amazon RDS on Graviton2 Processors](https://aws.amazon.com/blogs/aws/new-amazon-rds-on-graviton2-processors/)<br>Related blog: [Key considerations in moving to Graviton2 for Amazon RDS and Amazon Aurora databases](https://aws.amazon.com/blogs/database/key-considerations-in-moving-to-graviton2-for-amazon-rds-and-amazon-aurora-databases/)<br>For supported instance types and database engine versions see [RDS DB Instances](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html) |
29
+
[Amazon SageMaker](https://aws.amazon.com/pm/sagemaker/) | GA | What's New: [Amazon SageMaker adds eight new Graviton-based instances for model deployment](https://aws.amazon.com/about-aws/whats-new/2022/10/amazon-sagemaker-adds-new-graviton-based-instances-model-deployment/) <br> Related blog: [Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/run-machine-learning-inference-workloads-on-aws-graviton-based-instances-with-amazon-sagemaker/), [Reduce Amazon SageMaker inference cost with AWS Graviton](https://aws.amazon.com/blogs/machine-learning/reduce-amazon-sagemaker-inference-cost-with-aws-graviton/)|
While TSO allows reads to occur out-of-order with writes and a processor to
@@ -54,16 +54,16 @@ is corresponding Arm code there too. If not, that might be something to improve.
54
54
We welcome suggestions by opening an issue in this repo.
55
55
56
56
### Lock/Synchronization intensive workload
57
-
Graviton2 supports the Arm Large Scale Extensions (LSE). LSE based locking and synchronization
58
-
is an order of magnitude faster for highly contended locks with high core counts (e.g. 64 with Graviton2).
57
+
Graviton2 processors and later support the Arm Large Scale Extensions (LSE). LSE based locking and synchronization
58
+
is an order of magnitude faster for highly contended locks with high core counts (e.g. up to 192 cores on Graviton4).
59
59
For workloads that have highly contended locks, compiling with `-march=armv8.2-a` will enable LSE based atomics and can substantially increase performance. However, this will prevent the code
60
60
from running on an Arm v8.0 system such as AWS Graviton-based EC2 A1 instances.
61
61
With GCC 10 and newer an option `-moutline-atomics` will not inline atomics and
62
62
detect at run time the correct type of atomic to use. This is slightly worse
63
63
performing than `-march=armv8.2-a` but does retain backwards compatibility.
64
64
65
65
### Network intensive workloads
66
-
In some workloads, the packet processing capability of Graviton2 is both faster and
66
+
In some workloads, the packet processing capability of Graviton is both faster and
67
67
lower-latency than other platforms, which reduces the natural “coalescing”
68
68
capability of Linux kernel and increases the interrupt rate.
69
69
Depending on the workload it might make sense to enable adaptive RX interrupts
@@ -72,12 +72,29 @@ Depending on the workload it might make sense to enable adaptive RX interrupts
72
72
## Profiling the code
73
73
If you aren't getting the performance you expect, one of the best ways to understand what is
74
74
going on in the system is to compare profiles of execution and understand where the CPUs are
75
-
spending time. This will frequently point to a hot function that could be optimized. A crutch
75
+
spending time. This will frequently point to a hot function or sub-system that could be optimized. A crutch
76
76
is comparing a profile between a system that is performing well and one that isn't to see the
77
77
relative difference in execution time. Feel free to open an issue in this
78
78
GitHub repo for advice or help.
79
79
80
-
Install the Linux perf tool:
80
+
Using [AWS APerf](https://github.com/aws/aperf) tool:
81
+
```bash
82
+
# Graviton
83
+
wget -qO- https://github.com/aws/aperf/releases/download/v0.1.10-alpha/aperf-v0.1.10-alpha-aarch64.tar.gz | tar -xvz -C /target/directory
84
+
85
+
# x86
86
+
wget -qO- https://github.com/aws/aperf/releases/download/v0.1.10-alpha/aperf-v0.1.10-alpha-x86_64.tar.gz | tar -xvz -C /target/directory
87
+
88
+
## Record a profile and generate a report
89
+
cd /target/directory/
90
+
./aperf record -r <RUN_NAME> -i <INTERVAL_NUMBER> -p <COLLECTION_PERIOD>
Redhat Enterprise Linux | 8.2 or later | Yes | 64KB | [MarketPlace](https://aws.amazon.com/marketplace/pp/B07T2NH46P) | Yes |
14
-
~~Redhat Enterprise Linux~~ | ~~7.x~~ | ~~No~~ | ~~64KB~~ | ~~[MarketPlace](https://aws.amazon.com/marketplace/pp/B07KTFV2S8)~~ | | Supported on A1 instances but not on Graviton2 based ones
14
+
~~Redhat Enterprise Linux~~ | ~~7.x~~ | ~~No~~ | ~~64KB~~ | ~~[MarketPlace](https://aws.amazon.com/marketplace/pp/B07KTFV2S8)~~ | | Supported on A1 instances but not on Graviton2 and later based ones
15
15
AlmaLinux | 8.4 or later | Yes | 64KB | [AMIs](https://wiki.almalinux.org/cloud/AWS.html) | Yes |
16
16
Alpine Linux | 3.12.7 or later | Yes (*) | 4KB | [AMIs](https://www.alpinelinux.org/cloud/) | | (*) LSE enablement checked in version 3.14 |
17
17
CentOS | 8.2.2004 or later | No | 64KB | [AMIs](https://wiki.centos.org/Cloud/AWS#Images) | Yes | |
18
18
CentOS Stream | 8 | No (*) | 64KB (*) | [Downloads](https://www.centos.org/centos-stream/) | |(*) details to be confirmed once AMI's are available|
19
-
~~CentOS~~ | ~~7.x~~ | ~~No~~ | ~~64KB~~ | ~~[AMIs](https://wiki.centos.org/Cloud/AWS#Images)~~ | | Supported on A1 instances but not on Graviton2 based ones
19
+
~~CentOS~~ | ~~7.x~~ | ~~No~~ | ~~64KB~~ | ~~[AMIs](https://wiki.centos.org/Cloud/AWS#Images)~~ | | Supported on A1 instances but not on Graviton2 and later based ones
Debian | 10 | [Planned for Debian 11](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956418) | 4KB | [Community](https://wiki.debian.org/Cloud/AmazonEC2Image/Buster) or [MarketPlace](https://aws.amazon.com/marketplace/pp/B085HGTX5J) | Yes, as of Debian 10.7 (2020-12-07) |
22
22
FreeBSD | 12.1 or later | No | 4KB | [Community](https://www.freebsd.org/releases/12.1R/announce.html) or [MarketPlace](https://aws.amazon.com/marketplace/pp/B081NF7BY7) | No | Device hotplug and API shutdown don't work
Copy file name to clipboardexpand all lines: perfrunbook/README.md
+8-3
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,11 @@ This document is a reference for software developers who want to benchmark, debu
6
6
7
7
This document covers many topics including how to benchmark, how to debug performance and which optimization recommendations. It is not meant to be read beginning-to-end. Instead view it as a collection of checklists and best known practices to apply when working with Graviton instances that go progressively deeper into analyzing the system. Please see the FAQ below to direct you towards the most relevant set of checklists and tools depending on your specific situation.
8
8
9
-
If after following these guides there is still an issue you cannot resolve with regards to performance on Graviton2, please do not hesitate to raise an issue on the [AWS-Graviton-Getting-Started](https://github.com/aws/aws-graviton-getting-started/issues) guide or contact us at [ec2-arm-dev-feedback@amazon.com](mailto:ec2-arm-dev-feedback@amazon.com). If there is something missing in this guide, please raise an issue or better, post a pull-request.
9
+
If after following these guides there is still an issue you cannot resolve with regards to performance on Graviton based instances, please do not hesitate to raise an issue on the [AWS-Graviton-Getting-Started](https://github.com/aws/aws-graviton-getting-started/issues) guide or contact us at [ec2-arm-dev-feedback@amazon.com](mailto:ec2-arm-dev-feedback@amazon.com). If there is something missing in this guide, please raise an issue or better, post a pull-request.
10
10
11
11
## Pre-requisites
12
12
13
-
To assist with some of the tasks listed in this runbook, we have created some helper-scripts for some of the tasks the checklists describe. The helper-scripts assume the test instances are running an up-to-date AL2or Ubuntu 20.04LTS distribution and the user can run the scripts using `sudo`. Follow the steps below to obtain and install the utilities on your test systems:
13
+
To assist with some of the tasks listed in this runbook, we have created some helper-scripts for some of the tasks the checklists describe. The helper-scripts assume the test instances are running an up-to-date AL2, AL2023 or Ubuntu 20.04LTS/22.04LTS distribution and the user can run the scripts using `sudo`. Follow the steps below to obtain and install the utilities on your test systems:
14
14
15
15
```bash
16
16
# Clone the repository onto your systems-under-test and any load-generation instances
# All scripts expect to run from the utilities directory
24
24
```
25
25
26
+
## APerf for performance analysis
27
+
28
+
There is also a new tool aimed at helping move workloads over to Graviton called [APerf](https://github.com/aws/aperf), it bundles many of the capabilities of the individual tools present in this
29
+
runbook and provides a better presentation. It is highly recommended to download this tool and use it to gather most of the same information in one test-run.
30
+
26
31
## Sections
27
32
28
33
1.[Introduction to Benchmarking](./intro_to_benchmarking.md)
***I benchmarked my service and performance on Graviton is slower compared to my current x86 based fleet, where do I start to root cause why?**
49
54
Begin by verifying software dependencies and verifying the configuration of your Graviton and x86 testing environments to check that no major differences are present in the testing environment. Performance differences may be due to differences in environment and not the due to the hardware. Refer to the below chart for a step-by-step flow through this runbook to help root cause the performance regression:
50
55

51
-
***What are the recommended optimizations to try with Graviton2?**
56
+
***What are the recommended optimizations to try with Graviton?**
52
57
Refer to [Section 6](./optimization_recommendation.md) for our recommendations on how to make your application run faster on Graviton.
53
58
***I investigated every optimization in this guide and still cannot find the root-cause, what do I do next?**
54
59
Please contact us at [ec2-arm-dev-feedback@amazon.com](mailto:ec2-arm-dev-feedback@amazon.com) or talk with your AWS account team representative to get additional help.
Copy file name to clipboardexpand all lines: perfrunbook/appendix.md
+14-8
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,17 @@
4
4
5
5
This Appendix contains additional information for engineers that want to go deeper on a particular topic, such as using different PMU counters to understand how the code is executing on the hardware, discussion on load generators, and additional tools to help with code observability.
6
6
7
-
## Useful Graviton2 PMU Counters and ratios
7
+
## Useful Graviton PMU Events and ratios
8
8
9
-
The following list of counter ratios has been curated to list counters useful for performance debugging. The more extensive list of counters is contained in the following references:
9
+
The following list of counter ratios has been curated to list events useful for performance debugging. The more extensive list of counters is contained in the following references:
Copy file name to clipboardexpand all lines: perfrunbook/configuring_your_sut.md
+9-9
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ If you have more than one SUT, first verify there are no major differences in se
35
35
%> uname -r
36
36
4.14.219-161.340.amzn2.x86_64
37
37
38
-
# Example output on Graviton2 SUT
38
+
# Example output on Graviton SUT
39
39
%> uname -r
40
40
5.10.50-45.132.amzn2.aarch64
41
41
@@ -76,9 +76,9 @@ If you have more than one SUT, first verify there are no major differences in se
76
76
77
77
## Check for missing binary dependencies
78
78
79
-
Libraries forPython or Java can linkin binary shared objects to provide enhanced performance. The absence of these shared object dependencies does not prevent the application from running on Graviton2, but the CPU will be forced to use a slow code-path instead of the optimized paths. Use the checklist below to verify the same shared objects are available on all platforms.
79
+
Libraries forPython or Java can linkin binary shared objects to provide enhanced performance. The absence of these shared object dependencies does not prevent the application from running on Graviton, but the CPU will be forced to use a slow code-path instead of the optimized paths. Use the checklist below to verify the same shared objects are available on all platforms.
80
80
81
-
1. JVM based languages — Check forthe presence of binary shared objectsin the installed JARs and compare between Graviton2 and x86.
81
+
1. JVM based languages — Check forthe presence of binary shared objectsin the installed JARs and compare between Graviton and x86.
@@ -130,14 +130,14 @@ Libraries for Python or Java can link in binary shared objects to provide enhanc
130
130
131
131
## Check native application build system and code
132
132
133
-
For native compiled components of your application, proper compile flags are essential to make sure Graviton2’s hardware features are being fully taken advantage of. Follow the below checklist:
133
+
For native compiled components of your application, proper compile flags are essential to make sure Graviton’s hardware features are being fully taken advantage of. Follow the below checklist:
134
134
135
135
1. Verify equivalent code optimizations are being made for Graviton as well as x86. For example with C/C++ code built with GCC, make sure if builds use `-O3`for x86, that Graviton builds also use that optimization and not some basic debug setting like just `-g`.
136
136
2. Confirm when building for Graviton that **one of the following flags** are added to the compile line for GCC/LLVM12+ to ensure using Large System Extension instructions when able to speed up atomic operations.
137
-
1. Use `-moutline-atomics`for code that must run on Graviton1 and Graviton2
138
-
2. Use `-march=armv8.2a -mcpu=neoverse-n1`for code that will run on Graviton2 and other modern Arm platforms
137
+
1. Use `-moutline-atomics`for code that must run on all Graviton platforms
138
+
2. Use `-march=armv8.2a -mcpu=neoverse-n1`for code that will run on Graviton2 or later and other modern Arm platforms
139
139
3. When building natively for Rust, ensure that `RUSTFLAGS` is set to **one of the following flags**
140
-
1. `export RUSTFLAGS="-Ctarget-features=+lse"`for code that will run on Graviton2 and earlier platforms that support LSE (Large System Extension) instructions.
140
+
1. `export RUSTFLAGS="-Ctarget-features=+lse"`for code that will run on all Graviton2 and other Arm platforms that support LSE (Large System Extension) instructions.
141
141
2. `export RUSTFLAGS="-Ctarget-cpu=neoverse-n1"`for code that will only run on Graviton2 and later platforms.
142
142
4. Check for the existence of assembly optimized on x86 with no optimization on Graviton. For help with porting optimized assembly routines, see [Section 6](./optimization_recommendation.md).
Copy file name to clipboardexpand all lines: perfrunbook/debug_code_perf.md
+1-2
Original file line number
Diff line number
Diff line change
@@ -53,7 +53,7 @@ You may see a small single-digit percent increase in overhead with pseudo-NMI en
53
53
54
54
## Off-cpu profiling
55
55
56
-
If Graviton2 is consuming less CPU-time than expected, it is useful to find call-stacks that are putting *threads* to sleep via the OS. Lock contention, IO Bottlenecks, OS scheduler issues can all lead to cases where performance is lower, but the CPU is not being fully utilized. The method to look for what might be causing more off-cpu time is the same as with looking for functions consuming more on-cpu time: generate a flamegraph and compare. In this case, the differences are more subtle to look for as small differences can mean large swings in performance as more thread sleeps can induce milli-seconds of wasted execution time.
56
+
If Graviton is consuming less CPU-time than expected, it is useful to find call-stacks that are putting *threads* to sleep via the OS. Lock contention, IO Bottlenecks, OS scheduler issues can all lead to cases where performance is lower, but the CPU is not being fully utilized. The method to look for what might be causing more off-cpu time is the same as with looking for functions consuming more on-cpu time: generate a flamegraph and compare. In this case, the differences are more subtle to look for as small differences can mean large swings in performance as more thread sleeps can induce milli-seconds of wasted execution time.
57
57
58
58
1. Verify native (i.e. C/C++/Rust) code is built with `-fno-omit-frame-``pointer`
59
59
2. Verify java code is started with `-XX:+PreserveFramePointer -agentpath:/path/to/libperf-jvmti.so`
@@ -109,4 +109,3 @@ In our `capture_flamegraphs.sh` helper script, we use `perf record` to gather tr
109
109
1. Use `-e instructions` to generate a flame-graph of the functions that use the most instructions on average to identify a compiler or code optimization opportunity.
110
110
2. Use `-e cache-misses` to generate a flame-graph of functions that miss the L1 cache the most to indicate if changing to a more efficient data-structure might be necessary.
111
111
3. Use `-e branch-misses` to generate a flame-graph of functions that cause the CPU to mis-speculate. This may identify regions with heavy use of conditionals, or conditionals that are data-dependent and may be a candidate for refactoring.
Copy file name to clipboardexpand all lines: perfrunbook/debug_hw_perf.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ There are hundreds of events available to monitor in a server CPU today which is
16
16
17
17
## How to Collect PMU counters
18
18
19
-
A limited subset of PMU events for the CPU are available on Graviton \*6g, \*7g sizes <16xl, we recommend using a 16xl for experiments needing PMU events to get access to all of them. On 5th and 6th generation x86 instances use a single socket instance is needed to have access to the CPU PMU events: >c5.9xl, >\*5.12xl, >\*6i.16xl, >c5a.12xl, and >\*6a.24xl. On 7th generation x86 instances *7a and *7i, all sizes get access to a limited number of CPU PMU events, just like on Graviton instances, and full socket or larger instances (>\*7\*.24xl) get access to all PMU events.
19
+
A limited subset of PMU events for the CPU are available on Graviton \*6g, \*7g sizes <16xl, we recommend using a 16xl for experiments needing PMU events to get access to all of them. On Graviton \*8g, sizes >24xl have access to all the CPU PMU events. On 5th and 6th generation x86 instances use a single socket instance is needed to have access to the CPU PMU events: >c5.9xl, >\*5.12xl, >\*6i.16xl, >c5a.12xl, and >\*6a.24xl. On 7th generation x86 instances *7a and *7i, all sizes get access to a limited number of CPU PMU events, just like on Graviton instances, and full socket or larger instances (>\*7\*.24xl) get access to all PMU events.
20
20
21
21
To measure the standard CPU PMU events, do the following:
22
22
@@ -121,7 +121,7 @@ To measure the standard CPU PMU events, do the following:
121
121
122
122
This checklist describes the top-down method to debug whether the hardware is under-performing and what part is underperforming. The checklist describes event ratios to check that are included in the helper-script. All ratios are in terms of either misses-per-1000(kilo)-instruction or per-1000(kilo)-cycles. This checklist aims to help guide whether a hardware slow down is coming from the front-end of the processor or the backend of the processor and then what particular part. The front-end of the processor is responsible for fetching and supplying the instructions. The back-end is responsible for executing the instructions provided by the front-end as fast as possible. A bottleneck in either part will cause stalls and a decrease in performance. After determining where the bottleneck may lie, you can proceed to [Section 6](./optimization_recommendation.md) to read suggested optimizations to mitigate the problem.
123
123
124
-
1. Start by measuring `ipc` (Instructions per cycle) on each instance-type. A higher IPC is better. A lower number for `ipc` on Graviton2 compared to x86 indicates *that* there is a performance problem. At this point, proceed to attempt to root cause where the lower IPC bottleneck is coming from by collecting frontend and backend stall metrics.
124
+
1. Start by measuring `ipc` (Instructions per cycle) on each instance-type. A higher IPC is better. A lower number for `ipc` on Graviton compared to x86 indicates *that* there is a performance problem. At this point, proceed to attempt to root cause where the lower IPC bottleneck is coming from by collecting frontend and backend stall metrics.
125
125
2. Next, measure `stall_frontend_pkc` and `stall_backend_pkc` (pkc = per kilo cycle) and determine which is higher. If stalls in the frontend are higher, it indicates the part of the CPU responsible for predicting and fetching the next instructions to execute is causing slow-downs. If stalls in the backend are higher, it indicates the machinery that executes the instructions and reads data from memory is causing slow-downs
2. If seeing bursts, verify this is expected behavior for your load generator. Bursts can cause performance degradation on Graviton2 ifeach new connection has to do an RSA signing operation for TLS connection establishment.
83
+
2. If seeing bursts, verify this is expected behavior for your load generator. Bursts can cause performance degradation foreach new connection, especially if it has to do an RSA signing operation for TLS connection establishment.
84
84
3. Check on SUT for hot connections (connections that are more heavily used than others) by running: `watch netstat -t`
85
85
4. The example below shows the use of `netstat -t` to watch TCP connections with one being hot as indicated by its non-zero `Send-Q` value while all other connections have a value of 0. This can lead to one core being saturated by network processing on the SUT, bottlenecking the rest of the system.
86
86
```bash
@@ -117,11 +117,10 @@ When running Java applications, monitor for differences in behavior using JFR (J
117
117
3. The image below shows JMC’s GC pane, showing pause times, heap size and references remaining after each collection.
118
118

119
119
4. The same information can be gathered by enabling GC logging and then processing the log output. Enter `-Xlog:gc*,gc+age=trace,gc+ref=debug,gc+ergo=trace` on the Java command line and re-start your application.
120
-
5. If longer GC pauses are seen, this could be happening because objects are living longer on Graviton2 and the GC has to scan them. To help debug this gather an off-cpu profile ([see Section 5.b](./debug_code_perf.md)) to look for threads that are sleeping more often and potentially causing heap objects to live longer.
120
+
5. If longer GC pauses are seen, this could be happening because objects are living longer on Graviton and the GC has to scan them. To help debug this gather an off-cpu profile ([see Section 5.b](./debug_code_perf.md)) to look for threads that are sleeping more often and potentially causing heap objects to live longer.
121
121
6. Check for debug flags that are still enabled but should be disabled, such as: `-XX:-OmitStackTraceInFastThrow` which logs and generates stack traces for all exceptions, even if they are not fatal exceptions.
122
-
7. Check there are no major differences in JVM ergonomics between Graviton2 and x86, run:
122
+
7. Check there are no major differences in JVM ergonomics between Graviton and x86, run:
123
123
```bash
124
124
%> java -XX:+PrintFlagsFinal -version
125
-
# Capture output from x86 and Graviton2 and then diff the files
125
+
# Capture output from x86 and Graviton and then diff the files
When designing an experiment to benchmark Graviton2 against another instance type, it is key to remember the below 2 guiding principles:
5
+
When designing an experiment to benchmark Graviton based instances against another instance type, it is key to remember the below 2 guiding principles:
6
6
7
7
1. Always define a specific question to answer with your benchmark
8
8
2. Control your variables and unknowns within the benchmark environment
This section describes multiple different optimization suggestions to try on Graviton2 instances to attain higher performance for your service. Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
5
+
This section describes multiple different optimization suggestions to try on Graviton based instances to attain higher performance for your service. Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
6
6
7
7
## Optimizing for large instruction footprint
8
8
@@ -35,6 +35,8 @@ allocating huge-pages.
35
35
2. For additional information on the vector instructions used on Graviton
3. Disable Receive Packet Steering (RPS) to avoid contention and extra IPIs.
63
-
1.`cat /sys/class/net/ethN/queues/rx-N/rps_cpus` and verify they are set to `0`. In general RPS is not needed on Graviton2.
65
+
1.`cat /sys/class/net/ethN/queues/rx-N/rps_cpus` and verify they are set to `0`. In general RPS is not needed on Graviton2 and newer.
64
66
2. You can try using RPS if your situation is unique. Read the [documentation on RPS](https://www.kernel.org/doc/Documentation/networking/scaling.txt) to understand further how it might help. Also refer to [Optimizing network intensive workloads on Amazon EC2 A1 Instances](https://aws.amazon.com/blogs/compute/optimizing-network-intensive-workloads-on-amazon-ec2-a1-instances/) for concrete examples.
65
67
66
68
## Metal instance IO optimizations
67
69
68
-
1. If on Graviton2 metal instances, try disabling the System MMU (Memory Management Unit) to speed up IO handling:
70
+
1. If on Graviton2 and newer metal instances, try disabling the System MMU (Memory Management Unit) to speed up IO handling:
0 commit comments