WARNING: this information is obsolete

Please see https://llvm.org/docs/AMDGPUUsage.html for always up-to-date ABI information

AMDGPU Compute Application Binary Interface

Version 0.40 (March 2016)

Introduction
Finalizer, Code Object, Executable and Loader
Kernel dispatch
Hardware registers setup
Initial kernel register state
Kernel prolog code
Global/Readonly/Kernarg segments
Scratch memory swizzling
Flat addressing
Flat scratch
M0 Register
Dynamic call stack
Memory model
- Memory model overview
- Memory operation constraints for global segment
- Memory operation constraints for group segment
- Memory fence constraints
Instruction set architecture
AMD Kernel Code
- AMD Kernel Code Object amd_kernel_code_t
- Compute shader program settings 1 amd_compute_pgm_rsrc1_t
- Compute shader program settings 2 amd_compute_pgm_rsrc2_t
- AMD Machine Kind amd_machine_kind_t
- Float Round Mode amd_float_round_mode_t
- Denorm Mode amd_float_denorm_mode_t
PCIe Gen3 Atomic Operations
AMD Queue
- HSA AQL Queue Object hsa_queue_t
- AMD AQL Queue Object amd_queue_t
- Queue operations
Signals
- Signals overview
- Signal kind amd_signal_kind_t
- Signal object amd_signal_t
- Signal kernel machine code
Debugtrap
References

Introduction

This specification defines the application binary interface (ABI) provided by the AMD implementation of the HSA runtime for AMD GPU architecture agents. The AMD GPU architecture is a family of GPU agents which differ in machine code encoding and functionality.

Finalizer, Code Object, Executable and Loader

Finalizer, Code Object, Executable and Loader are defined in "HSA Programmer Reference Manual Specification". AMD Code Object uses ELF format. In this document, Finalizer is any compiler producing code object, including kernel machine code.

Kernel dispatch

The HSA Architected Queuing Language (AQL) defines a user space memory interface, an AQL Queue, to an agent that can be used to control the dispatch of kernels, using AQL Packets, in an agent independent way. All AQL packets are 64 bytes and are defined in "HSA Platform System Architecture Specification". The packet processor of a kernel agent is responsible for detecting and dispatching kernels from the AQL Queues associated with it. For AMD GPUs the packet processor is implemented by the Command Processor (CP).

The AMD HSA runtime allocates the AQL Queue object. It uses the AMD Kernel Fusion Driver (KFD) to initialize and register the AQL Queue with CP. Refer to "AMD Queue" for more information.

A kernel dispatch is initiated with the following sequence defined in "HSA System Architecture Specification" (it may occur on CPU host agent from a host program, or from an HSA kernel agent such as a GPU from another kernel):

A pointer to an AQL Queue for the kernel agent on which the kernel is to be executed is obtained.
A pointer to the amd_kernel_code_t object of the kernel to execute is obtained. It must be for a kernel that was loaded on the kernel agent with which the AQL Queue is associated.
Space is allocated for the kernel arguments using the HSA runtime allocator for a memory region with the kernarg property for the kernel agent that will execute the kernel, and the values of the kernel arguments are assigned. This memory corresponds to the backing memory for the kernarg segment within the kernel being called. Its layout is defined in "HSA Programmer Reference Manual Specification". For AMD the kernel execution directly uses the backing memory for the kernarg segment as the kernarg segment.
Queue operations is used to reserve space in the AQL queue for the packet.
The packet contents are set up, including information about the actual dispatch, such as grid and work-group size, together with information from the code object about the kernel, such as segment sizes.
The packet is assigned to packet processor by changing format field from INVALID to KERNEL_DISPATCH. Atomic memory operation must be used.
A doorbell signal for the queue is signaled to notify packet processor.

At some point, CP performs actual kernel execution:

CP detects a packet on AQL queue.
CP executes micro-code for setting up the GPU and wavefronts for a kernel dispatch.
CP ensures that when a wavefront starts executing the kernel machine code, the scalar general purpose registers (SGPR) and vector general purpose registers (VGPR) are set up based on flags in amd_kernel_code_t (see "Initial kernel register state").
When a wavefront start executing the kernel machine code, the prolog (see "Kernel prolog code") sets up the machine state as necessary.
When the kernel dispatch has completed execution, CP signals the completion signal specified in the kernel dispatch packet if not 0.

Hardware registers setup

SH_MEM_CONFIG register:

DEFAULT_MTYPE = 1 (MTYPE_NC)
ALIGNMENT_MODE = 3 (SH_MEM_ALIGNMENT_MODE_UNALIGNED)
PTR32 = 1 in 32-bit mode and 0 in 64-bit mode

Initial kernel register state

Prior to start of every wavefront execution, CP/SPI sets up the register state based on enable_sgpr_* and enable_vgpr_* flags in amd_kernel_code_t object:

SGPRs before the Work-Group Ids are set by CP using the 16 User Data registers.
Work-group Id registers X, Y, Z are set by SPI which supports any combination including none.
Scratch Wave Offset is also set by SPI which is why its value cannot be added into the value Flat Scratch Offset (which would avoid the Finalizer generated prolog having to do the add).
The VGPRs are set by SPI which only supports specifying either (X), (X, Y) or (X, Y, Z).

SGPR register numbers used for enabled registers are dense starting at SGPR0: the first enabled register is SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have an SGPR number. Because of hardware constraints, the initial SGPRs comprise up to 16 User SRGPs that are set up by CP and apply to all waves of the grid. It is possible to specify more than 16 User SGPRs using the enable_sgpr_* bit fields, in which case only the first 16 are actually initialized. These are then immediately followed by the System SGPRs that are set up by ADC/SPI and can have different values for each wave of the grid dispatch.

The number of enabled registers must match value in compute_pgm_rsrc2.user_sgpr (the total count of SGPR user data registers enabled). The enableGridWorkGroupCount* is currently not implemented in CP and should always be 0.

The following table defines SGPR registers that can be enabled and their order.

SGPR Order	Number of Registers	Name	Description
First	4	Private Segment Buffer (enable_sgpr_private_segment_buffer)	V# that can be used, together with Scratch Wave Offset as an offset, to access the Private/Spill/Arg segments using a segment address. CP uses the value from amd_queue_t.scratch_resource_descriptor.
then	2	Dispatch Ptr (enable_sgpr_dispatch_ptr)	64 bit address of AQL dispatch packet for kernel actually executing.
then	2	Queue Ptr (enable_sgpr_queue_ptr)	64 bit address of amd_queue_t object for AQL queue on which the dispatch packet was queued.
then	2	Kernarg Segment Ptr (enable_sgpr_kernarg_segment_ptr)	64 bit address of Kernarg segment. This is directly copied from the kernarg_address in the kernel dispatch packet. Having CP load it once avoids loading it at the beginning of every wavefront.
then	2	Dispatch Id (enable_sgpr_dispatch_id)	64 bit Dispatch ID of the dispatch packet being executed.
then	2	Flat Scratch Init (enable_sgpr_flat_scratch_init)	Value used for FLAT_SCRATCH register initialization. Refer to Flat scratch for more information.
then	1	Private Segment Size (enable_sgpr_private_segment_size)	The 32 bit byte size of a single work-items scratch memory allocation. This is the value from the kernel dispatch packet Private Segment Byte Size rounded up by CP to a multiple of DWORD. Having CP load it once avoids loading it at the beginning of every wavefront. Not used for GFX7/GFX8 since it is the same value as the second SGPR of Flat Scratch Init.
then	1	Grid Work-Group Count X (enable_sgpr_grid_workgroup_count_X)	32 bit count of the number of work-groups in the X dimension for the grid being executed. Computed from the fields in the kernel dispatch packet as ((grid_size.x + workgroup_size.x - 1) / workgroup_size.x).
then	1	Grid Work-Group Count Y (enable_sgpr_grid_workgroup_count_Y && less than 16 previous SGPRs)	32 bit count of the number of work-groups in the Y dimension for the grid being executed. Computed from the fields in the kernel dispatch packet as ((grid_size.y + workgroup_size.y - 1) / workgroupSize.y). Only initialized if <16 previous SGPRs initialized.
then	1	Grid Work-Group Count Z (enable_sgpr_grid_workgroup_count_Z && less than 16 previous SGPRs)	32 bit count of the number of work-groups in the Z dimension for the grid being executed. Computed from the fields in the kernel dispatch packet as ((grid_size.z + workgroup_size.z - 1) / workgroupSize.z). Only initialized if <16 previous SGPRs initialized.
then	1	Work-Group Id X (enable_sgpr_workgroup_id_X)	32 bit work group id in X dimension of grid for wavefront. Always present.
then	1	Work-Group Id Y (enable_sgpr_workgroup_id_Y)	32 bit work group id in Y dimension of grid for wavefront.
then	1	Work-Group Id Z (enable_sgpr_workgroup_id_Z)	32 bit work group id in Z dimension of grid for wavefront. If present then Work-group Id Y will also be present.
then	1	Work-Group Info (enable_sgpr_workgroup_info)	{first_wave, 14b0000, ordered_append_term[10:0], threadgroup_size_in_waves[5:0]}
then	1	Private Segment Wave Byte Offset (enable_sgpr_private_segment_wave_byte_offset)	32 bit byte offset from base of scratch base of queue executing the kernel dispatch. Must be used as an offset with Private/Spill/Arg segment address when using Scratch Segment Buffer. It must be added to Flat Scratch Offset if setting up FLAT SCRATCH for flat addressing.

VGPR register numbers used for enabled registers are dense starting at VGPR0: the first enabled register is VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a VGPR number.

The following table defines VGPR registers that can be enabled and their order.

VGPR Order	Number of Registers	Name	Description
First	1	Work-Item Id X (Always initialized)	32 bit work item id in X dimension of work-group for wavefront lane.
then	1	Work-Item Id Y (enable_vgpr_workitem_id > 0)	32 bit work item id in Y dimension of work-group for wavefront lane.
then	1	Work-Item Id Z (enable_vgpr_workitem_id > 1)	32 bit work item id in Z dimension of work-group for wavefront lane.

Kernel prolog code

For certain features, kernel is expected to perform initialization actions, normally done in kernel prologue. This is only needed if kernel uses those features.

Global/Readonly/Kernarg segments

Global segment can be accessed either using flat or buffer operations. Buffer operations cannot be used for large machine model for GFX7 and later as V# support for 64 bit addressing is not available.

If buffer operations are used then the Global Buffer used to access Global/Readonly/Kernarg (combined) segments using a segment address is not passed into the kernel code by CP since its base address is always 0. The prolog code initializes 4 SGPRs with a V# that has the following properties, and then uses that in the buffer instructions:

base address of 0
no swizzle
ATC: 1 if IOMMU present (such as APU)
MTYPE set to support memory coherence specified in amd_kernel_code_t.global_memory_coherence

If buffer operations are used to access Kernarg segment, Kernarg address must be added. It is available in dispatch packet (kernarg_address field) or as Kernarg Segment Ptr SGPR. Alternatively, scalar loads can be used if the kernarg offset is uniform, as the kernarg segment is constant for the duration of the kernel dispatch execution.

For GFX9, global segment can be accessed with new GLOBAL_* instructions.

Scratch memory swizzling

Scratch memory may be used for private/spill/stack segment. Hardware will interleave (swizzle) scratch accesses of each lane of a wavefront by interleave (swizzle) element size to ensure each work-item gets a distinct memory location. Interleave size must be 2, 4, 8 or 16. The value used must match the value that the runtime configures the GPU flat scratch (SH_STATIC_MEM_CONFIG.ELEMENT_SIZE).

For GFX8 and earlier, all load and store operations done to scratch buffer must not exceed this size. For example, if the element size is 4 (32-bits or dword) and a 64-bit value must be loaded, it must be split into two 32-bit loads. This ensures that the interleaving will get the work-item specific dword for both halves of the 64-bit value. If it just did a 64-bit load then it would get one dword which belonged to its own work-item, but the second dword would belong to the adjacent lane work-item since the interleaving is in dwords.

AMD HSA Runtime Finalizer uses value 4.

Flat addressing

Flat address can be used in FLAT instructions and can access global, private (scratch) and group (lds) memory.

Flat access to scratch requires hardware aperture setup and setup in kernel prologue (see Flat scratch).

For GFX7/GFX8, flat access to lds requires hardware aperture setup and M0 register setup (see M0 register).

Address operations for group/private segment may use fields from amd_queue_t, the address of which can be obtained with Queue Ptr SGPR.

To obtain null address value for a segment (nullptr HSAIL instruction),

For global, readonly and flat segment use value 0.
For group, private and kernarg segments, use value -1 (32-bit).

To convert segment address to flat address (stof HSAIL instruction),

For global segment, use the same value.
For kernarg segment, add Kernarg Segment Ptr. For small model, this is a 32-bit add. For large model, this is 32-bit add to 64-bit base address.
For group segment,
- for large model, combine group_segment_aperture_base_hi (upper half) and segment address (lower half),
- for small model, add group_segment_aperture_base_hi and segment address.
For private/spill/arg segment,
- for large model, combine private_segment_aperture_base_hi (upper half) and segment address (lower half),
- for small model, add private_segment_aperture_base_hi and segment address.
If flat address may be null, kernarg, group and private/spill arg segment machine code must have additional sequence (use V_CMP and V_CNDMASK).

To convert flat address to segment address (ftos HSAIL instruction),

For global segment, use the same value.
For kernarg segment, subtract Kernarg Segment Ptr. For small model, this is a 32-bit subtract. For large model, this is 32-bit subtract from lower half of the 64-bit flat address.
For group segment,
- for large model, use low half of the flat address,
- for small model, subtract group_segment_aperture_base_hi.
For private/spill/arg segment,
- for large model, use low half of the flat address,
- for small model, subtract private_segment_aperture_base_hi.
If segment address may be null, kernarg, group and private/spill arg segment machine code must have additional sequence (use V_CMP and V_CNDMASK).

To determine if given flat address lies within a segment (segmentp HSAIL instruction),

For global segment, check that address does not lie in group/private segments
For group segment, check if address lies in group segment aperture
- for large model, check that upper half of 64-bit address == group_segment_aperture_base_hi,
- for small model, check that most significant 16 bits of 32-bit address (address & ~0xFFFF) == group_segment_aperture_base_hi.
For private segment, check if address lies in private segment aperture
- for large model, check that upper half of 64-bit address == private_segment_aperture_base_hi,
- for small model, check that most significant 16 bits of 32-bit address (address & ~0xFFFF) == group_segment_aperture_base_hi.
If flat address may be null, machine code must have additional sequence (use V_CMP).

Flat scratch

If kernel may use flat operations to access scratch memory, the prolog code must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI or SGPRn-4/SGPRn-3).

For GFX7/GFX8, initialization uses Flat Scratch Init and Scratch Wave Offset sgpr registers (see Initial kernel register state):

The low word of Flat Scratch Init is 32 bit byte offset from SH_HIDDEN_PRIVATE_BASE_VIMID to base of memory for scratch for the queue executing the kernel dispatch. This is the lower 32 bits of amd_queue_t.scratch_backing_memory_location and is the same offset used in computing the Scratch Segment Buffer base address. The prolog must add the value of Scratch Wave Offset to it, shift right by 8 (offset is in 256-byte units) and move to FLAT_SCRATCH_LO for use as the FLAT SCRATCH BASE in flat memory instructions.
The second word of Flat Scratch Init is 32 bit byte size of a single work-items scratch memory usage. This is directly loaded from the kernel dispatch packet Private Segment Byte Size and rounded up to a multiple of DWORD. Having CP load it once avoids loading it at the beginning of every wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH SIZE.

For GFX9, Flat Scrath Init contains 64-bit address of scratch backing memory. The initialization sequence for FLAT_SCRATCH does 64-bit add of Flat Scratch Init and Scratch Wave Offset.

M0 register

For GF7/GFX8, M0 register must be initialized with total LDS size if kernel may access LDS via DS or flat operations. Total LDS size is available in dispatch packet. For M0, it is also possible to use maximum possible value of LDS for given target.

Dynamic call stack

In certain cases, Finalizer cannot compute the total private segment size at compile time. This can happen if calls are implemented using a call stack and recursion, alloca or calls to indirect functions are present. In this case, workitem_private_segment_byte_size field in code object only specifies the statically known private segment size. When performing actual kernel dispatch, private_segment_size_bytes field in dispatch packet will contain static private segment size plus additional space for the call stack.

Memory model

Memory model overview

A memory model describes the interactions of threads through memory and their shared use of the data. Many modern programming languages implement a memory model. This section describes the mapping of common memory model constructs onto AMD GPU architecture.

Through this section, definitions and constraints from "HSA Platform System Architecture Specification 1.0" are used as reference, although similar notions exist elsewhere (for example, in C99 or C++ 11).

The following memory scopes are defined:

Work-item (wi)
Wavefront (wave)
Work-group (wg)
Agent (agent)
System (system)

The following memory orders are defined:

scacq: sequentially consistent acquire
screl: sequentially consistent release
scar: sequentially consistent acquire and release
rlx: relaxed

The following memory operations are defined:

Ordinary Load/Store (non-synchronizing operations)
Atomic Load/Atomic Store (synchronizing operations)
Atomic RMW (Read-Modify-Write: add, sub, max, min, and, or, xor, wrapinc, wrapdec, exch, cas (synchronizing operations)
Memory Fence (synchronizing operation)

Sometimes derived notation is used. For example, agent+ means agent and system scopes, wg- means work-group, wavefront and work-item scopes.

In the following sections, a combination of memory segment, operation, order and scope is assigned a machine code sequence. Note that if s_waitcnt vmcnt(0) is used to enforce a completion of earlier memory operations in same workitem, it can be omitted if it is also enforced using some other mechanism or proven by compiler (for example, if there are no preceding synchronizing memory operations). Similiarily, if s_waitcnt vmcnt(0) is used to enforce completion of this memory operation before the following memory operations, sometimes it can be omitted (for example, if there are no following synchronizing memory operations).

For a flat memory operation, if it may affect either global or group segment, group constraints must be applied to flat operations as well.

Memory operation constraints for global segment

For global segment, the following machine code instructions may be used (see Global/Readonly/Kernarg segments):

Ordinary Load/Store: BUFFER_LOAD/BUFFER_STORE or FLAT_LOAD/FLAT_STORE
Atomic Load/Store: BUFFER_LOAD/BUFFER_STORE or FLAT_LOAD/FLAT_STORE
Atomic RMW: BUFFER_ATOMIC or FLAT_ATOMIC

Operation	Memory order	Memory scope	Machine code sequence
Ordinary Load	-	-	load with glc=0
Atomic Load	rlx,scacq	wg-	load with glc=0
Atomic Load	rlx	agent+	load with glc=1
Atomic Load	scacq	agent+	load with glc=1; s_waitcnt vmcnt(0); buffer_wbinv_vol
Ordinary Store	-	-	store with glc=0
Atomic Store	rlx,screl	wg-	store with glc=0
Atomic Store	rlx	agent+	store with glc=0
Atomic Store	screl	agent+	s_waitcnt vmcnt(0); store with glc=0; s_waitcnt vmcnt(0)
Atomic RMW	rlx,scacq, screl, scar	wg-	atomic
Atomic RMW	rlx	agent+	atomic
Atomic RMW	scacq	agent+	atomic; s_waitcnt vmcnt(0); buffer_wbinv_vol
Atomic RMW	screl	agent+	s_waitcnt vmcnt(0); atomic
Atomic RMW	scar	agent+	s_waitcnt vmcnt(0); atomic; s_waitcnt vmcnt(0); buffer_wbinv_vol

Memory operation constraints for group segment

For group segment, the following machine code instructions are used:

Ordinary Load/Store: DS_READ/DS_WRITE
Atomic Load/Store: DS_READ/DS_WRITE
Atomic RMW: DS_ADD, DS_SUB, DS_MAX, DS_MIN, DS_AND, DS_OR, DS_XOR, DS_INC, DS_DEC, DS_WRXCHG, DS_CMPST (and corresponding RTN variants)

AMD LDS hardware is sequentially consistent. This means that it is not necessary to use lgkmcnt to enforce ordering in single work-item for group segment synchronization. s_waitcnt lgkmcnt(0) should still be used to enforce data dependencies, for example, after a load into a register and before use of that register (same applies to non-synchronizing operations).

The current model (and HSA) requires that global and group segments are coherent. This is why synchronizing group segment operations and memfence also use s_waitcnt vmcnt(0).

Operation	Memory order	Memory scope	Machine code sequence
Ordinary Load	-	-	load
Atomic Load	rlx	wg-	load
Atomic Load	scacq	wg-	s_waitcnt vmcnt(0); load; buffer_wbinvl1_vol
Ordinary Store	-	-	store
Atomic Store	rlx	wg-	store
Atomic Store	screl	wg-	s_waitcnt vmcnt(0); store
Atomic RMW	scacq	wg-	s_waitcnt vmcnt(0); atomic; buffer_wbinvl1_vol
Atomic RMW	screl	wg-	s_waitcnt vmcnt(0); atomic
Atomic RMW	scacq	wg-	s_waitcnt vmcnt(0); atomic; buffer_wbinvl1_vol

Memory fence constraints

Memory fence is currently applied to all segments (cross-segment synchronization). In machine code, memory fence does not have separate instruction and maps to s_waitcnt and buffer_wbinvl1_vol instructions. In addition, memory fence must not be moved in machine code with respect to other synchronizing operations. In the following table, 'memfence' refers to conceptual memory fence location.

Operation	Memory order	Memory scope	Machine code sequence
Memory Fence	scacq,screl,scar	wg-	memfence (no additional constraints)
Memory Fence	scacq	agent+	memfence; s_waitcnt 0; buffer_wbinvl1_vol
Memory Fence	screl	agent+	s_waitcnt 0; memfence
Memory Fence	scar	agent +	memfence; s_waitcnt 0; buffer_wbinvl1_vol

Instruction set architecture

AMDGPU ISA specifies instruction set architecture and capabilities used by machine code. It consists of several fields:

Vendor ("AMD")
Architecture ("AMDGPU")
Major (GFXIP), minor and stepping versions

These fields may be combined to form one defining string, for example, "AMD:AMDGPU:8:0:1".

Vendor	Architecture	Major	Stepping	Comments	Products
AMD	AMDGPU	7	0	Legacy, GFX7, 1/16 double FP	A10-7400 series APU
AMD	AMDGPU	7	1	GFX7, 1/2 double FP	FirePro W8100, W9100, S9150, S9170; Radeon R9 290, R9 290x, R390, R390x
AMD	AMDGPU	8	0	Legacy, GFX8, SPI register limitation, -XNACK	FirePro S7150, S7100, W7100; Radeon R285, R9 380, R9 385; Mobile FirePro M7170
AMD	AMDGPU	8	1	GFX8, XNACK enabled	A10-8700 series APU
AMD	AMDGPU	8	2	GFX8, SPI register limitation, XNACK disabled, PCIe Gen3 atomics	FirePro S7150, S7100, W7100; Radeon R285, R9 380, R9 385; Mobile FirePro M7170
AMD	AMDGPU	8	3	GFX8, XNACK disabled, PCIe Gen3 atomics	Radeon R9 Nano, R9 Fury, R9 FuryX, Pro Duo, RX 460, RX 470, RX 480; FirePro S9300x2
AMD	AMDGPU	8	4	GFX8, -XNACK	Legacy, Radeon R9 Nano, R9 Fury, R9 FuryX, Pro Duo, RX 460, RX 470, RX 480; FirePro S9300x2
AMD	AMDGPU	9	0	GFX9, -XNACK
AMD	AMDGPU	9	1	GFX9, +XNACK

AMD Kernel Code

AMD Kernel Code object is used by AMD GPU CP to set up the hardware to execute a kernel dispatch and consists of the meta data needed to initiate the execution of a kernel, including the entry point address of the machine code that implements the kernel.

AMD Kernel Code Object amd_kernel_code_t

Bits	Size	Field Name	Description
31:0	4 bytes	amd_code_version_major	The AMD major version. Must be the value AMD_KERNEL_CODE_VERSION_MAJOR. Major versions are not backwards compatible.
63:32	4 bytes	amd_code_version_minor	The AMD minor version. Must be the value AMD_CODE_VERSION_MINOR. Minor versions with the same major version must be backward compatible.
79:64	2 bytes	amd_machine_kind	Machine kind.
95:80	2 bytes	amd_machine_version_major	Instruction set architecture: major
111:96	2 bytes	amd_machine_version_minor	Instruction set architecture: minor
127:112	2 bytes	amd_machine_version_stepping	Instruction set architecture: stepping
191:128	8 bytes	kernel_code_entry_byte_offset	Byte offset (possibly negative) from start of amd_kernel_code_t object to kernel's entry point instruction. The actual code for the kernel is required to be 256 byte aligned to match hardware requirements (SQ cache line is 16; entry point config register only holds bits 47:8 of the address). The Finalizer should endeavor to allocate all kernel machine code in contiguous memory pages so that a device pre-fetcher will tend to only pre-fetch Kernel Code objects, improving cache performance. The AMD HA Runtime Finalizer generates position independent code (PIC) to avoid using relocation records and give runtime more flexibility in copying code to discrete GPU device memory.
255:192	8 bytes	kernel_code_prefetch_byte_offset	Range of bytes to consider prefetching expressed as a signed offset and unsigned size. The (possibly negative) offset is from the start of amd_kernel_code_t object. Set both to 0 if no prefetch information is available.
319:256	8 bytes	kernel_code_prefetch_byte_size
383:320	8 bytes	max_scratch_backing_memory_byte_size	Number of bytes of scratch backing memory required for full occupancy of target chip. This takes into account the number of bytes of scratch per work-item, the wavefront size, the maximum number of wavefronts per CU, and the number of CUs. This is an upper limit on scratch. If the grid being dispatched is small it may only need less than this. If the kernel uses no scratch, or the Finalizer has not computed this value, it must be 0.
415:384	4 bytes	compute_pgm_rsrc1	Compute Shader (CS) program settings 1 amd_compute_pgm_rsrc1
447:416	4 bytes	compute_pgm_rsrc2	Compute Shader (CS) program settings 2 amd_compute_pgm_rsrc2
448	1 bit	enable_sgpr_private_segment_buffer	Enable the setup of Private Segment Buffer
449	1 bit	enable_sgpr_dispatch_ptr	Enable the setup of Dispatch Ptr
450	1 bit	enable_sgpr_queue_ptr	Enable the setup of Queue Ptr
451	1 bit	enable_sgpr_kernarg_segment_ptr	Enable the setup of Kernarg Segment Ptr
452	1 bit	enable_sgpr_dispatch_id	Enable the setup of Dispatch Id
453	1 bit	enable_sgpr_flat_scratch_init	Enable the setup of Flat Scratch Init
454	1 bit	enable_sgpr_private_segment_size	Enable the setup of Private Segment Size
455	1 bit	enable_sgpr_grid_workgroup_count_X	Enable the setup of Grid Work-Group Count X
456	1 bit	enable_sgpr_grid_workgroup_count_Y	Enable the setup of Grid Work-Group Count Y
457	1 bit	enable_sgpr_grid_workgroup_count_Z	Enable the setup of Grid Work-Group Count Z
463:458	6 bits		Reserved. Must be 0.
464	1 bit	enable_ordered_append_gds	Control wave ID base counter for GDS ordered-append. Used to set COMPUTE_DISPATCH_INITIATOR.ORDERED_APPEND_ENBL.
466:465	2 bits	private_element_size	Interleave (swizzle) element size in bytes.
467	1 bit	is_ptr64	1 if global memory addresses are 64 bits, otherwise 0. Must match SH_MEM_CONFIG.PTR32 (GFX7), SH_MEM_CONFIG.ADDRESS_MODE (GFX8+).
468	1 bit	is_dynamic_call_stack	Indicates if the generated machine code is using dynamic call stack.
469	1 bit	is_debug_enabled	Indicates if the generated machine code includes code required by the debugger.
470	1 bit	is_xnack_enabled	Indicates if the generated machine code uses conservative XNACK register allocation.
479:471	9 bits	reserved	Reserved. Must be 0.
511:480	4 bytes	workitem_private_segment_byte_size	The amount of memory required for the static combined private, spill and arg segments for a work-item in bytes.
543:512	4 bytes	workgroup_group_segment_byte_size	The amount of group segment memory required by a work-group in bytes. This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched.
575:544	4 bytes	gds_segment_byte_size	Number of byte of GDS required by kernel dispatch. Must be 0 if not using GDS.
639:576	8 bytes	kernarg_segment_byte_size	The size in bytes of the kernarg segment that holds the values of the arguments to the kernel. This could be used by CP to prefetch the kernarg segment pointed to by the kernel dispatch packet.
671:640	4 bytes	workgroup_fbarrier_count	Number of fbarrier's used in the kernel and all functions it calls. If the implementation uses group memory to allocate the fbarriers then that amount must already be included in the workgroup_group_segment_byte_size total.
687:672	2 bytes	wavefront_sgpr_count	Number of scalar registers used by a wavefront. This includes the special SGPRs for VCC, Flat Scratch (Base, Size) and XNACK (for GFX8 (VI)+). It does not include the 16 SGPR added if a trap handler is enabled. Must match compute_pgm_rsrc1.sgprs used to set COMPUTE_PGM_RSRC1.SGPRS.
703:688	2 bytes	workitem_vgpr_count	Number of vector registers used by each work-item. Must match compute_pgm_rsrc1.vgprs used to set COMPUTE_PGM_RSRC1.VGPRS.
719:704	2 bytes	reserved_vgpr_first	If reserved_vgpr_count is 0 then must be 0. Otherwise, this is the first fixed VGPR number reserved.
735:720	2 bytes	reserved_vgpr_count	The number of consecutive VGPRs reserved by the client. If is_debug_supported then this count includes VGPRs reserved for debugger use.
751:736	2 bytes	reserved_sgpr_first	If reserved_sgpr_count is 0 then must be 0. Otherwise, this is the first fixed SGPR number reserved.
767:752	2 bytes	reserved_sgpr_count	The number of consecutive SGPRs reserved by the client. If is_debug_supported then this count includes SGPRs reserved for debugger use.
783:768	2 bytes	debug_wavefront_private_segment_offset_sgpr	If is_debug_supported is 0 then must be 0. Otherwise, this is the fixed SGPR number used to hold the wave scratch offset for the entire kernel execution, or uint16_t(-1) if the register is not used or not known.
799:784	2 bytes	debug_private_segment_buffer_sgpr	If is_debug_supported is 0 then must be 0. Otherwise, this is the fixed SGPR number of the first of 4 SGPRs used to hold the scratch V# used for the entire kernel execution, or uint16_t(-1) if the registers are not used or not known.
807:800	1 byte	kernarg_segment_alignment	The maximum byte alignment of variables used by the kernel in the specified memory segment. Expressed as a power of two as defined in Table 37. Must be at least HSA_POWERTWO_16.
815:808	1 byte	group_segment_alignment
823:816	1 byte	private_segment_alignment
831:824	1 byte	wavefront_size	Wavefront size expressed as a power of two. Must be a power of 2 in range 1..256 inclusive. Used to support runtime query that obtains wavefront size, which may be used by application to allocated dynamic group memory and set the dispatch work-group size.
863:832	4 bytes	call_convention	Call convention used to produce the machine code for the kernel. This specifies the function call convention ABI used for indirect functions. If the application specified that the Finalizer should select the call convention, then this value must be the value selected, not the -1 specified to the Finalizer. If the code object does not support indirect functions, then the value must be 0xffffffff.
960:864	12 bytes		Reserved. Must be 0.
1023:960	8 bytes	runtime_loader_kernel_symbol	A pointer to the loaded kernel symbol. This field must be 0 when amd_kernel_code_t is created. The HSA Runtime loader initializes this field once the code object is loaded to reference the loader symbol for the kernel. This field is used to allow the debugger to locate the debug information for the kernel. The definition of the loaded kernel symbol is located in hsa/runtime/executable.hpp.
2047:1024	128 bytes	control_directive	Control directives for this kernel used in generating the machine code. The values are intended to reflect the constraints that the code actually requires to correctly execute, not the values that were actually specified at finalize time. If the finalizer chooses to ignore a control directive, and not generate constrained code, then the control directive should not be marked as enabled.
2048			Total size 256 bytes.

Compute shader program settings 1 amd_compute_pgm_rsrc1_t

The fields of amd_compute_pgm_rsrc1 are used by CP to set up COMPUTE_PGM_RSRC1.

Bits	Size	Field Name	Description
5:0	6 bits	granulated_workitem_vgpr_count	Granulated number of vector registers used by each work-item minus 1 (i.e. if granulated number of vector registers is 2, then 1 is stored in this field). Granularity is device specific.
9:6	4 bits	granulated_wavefront_sgpr_count	Granulated number of scalar registers used by a wavefront minus 1 (i.e. if granulated number of scalar registers is 4, then 3 is stored in this field). Granularity is device specific. This includes the special SGPRs for VCC, Flat Scratch (Base, and Size) and XNACK (for GFX8 (VI)+). It does not include the 16 SGPR added if a trap handler is enabled.
11:10	2 bits	priority	Drives spi_priority in spi_sq newWave cmd.
13:12	2 bits	float_mode_round_32	Wavefront initial float round mode for single precision floats (32 bit).
15:14	2 bits	float_mode_round_16_64	Wavefront initial float round mode for double/half precision floats (64/16 bit).
17:16	2 bits	float_mode_denorm_32	Wavefront initial denorm mode for single precision floats (32 bit).
19:18	2 bits	float_mode_denorm_16_64	Wavefront initial denorm mode for double/half precision floats (64/16 bit).
20	1 bit	priv	Drives priv in spi_sq newWave cmd. This field is set to 0 by the Finalizer and must be filled in by CP.
21	1 bit	enable_dx10_clamp	Wavefront starts execution with DX10 clamp mode enabled. Used by the vector ALU to force DX-10 style treatment of NaN's (when set, clamp NaN to zero, otherwise pass NaN through). Used by CP to set up COMPUTE_PGM_RSRC1.DX10_CLAMP.
22	1 bit	debug_mode	Drives debug in spi_sq newWave cmd. This field is set to 0 by the Finalizer and must be filled in by CP.
23	1 bit	enable_ieee_mode	Wavefront starts execution with IEEE mode enabled. Floating point opcodes that support exception flag gathering will quiet and propagate signaling-NaN inputs per IEEE 754-2008. Min_dx10 and max_dx10 become IEEE 754-2008 compliant due to signaling-NaN propagation and quieting. Used by CP to set up COMPUTE_PGM_RSRC1.IEEE_MODE.
24	1 bit	bulky	Only one such work-group is allowed to be active on any given Compute Unit. Only one such work-group is allowed to be active on any given CU. This field is set to 0 by the Finalizer and must be filled in by CP.
25	1 bit	cdbg_user	This field is set to 0 by the Finalizer and must be filled in by CP.
31:26	6 bits	reserved	Reserved. Must be 0.
32			Total size 4 bytes.

Compute shader program settings 2 amd_compute_pgm_rsrc2_t

The fields of amd_compute_pgm_rsrc2 are used by CP to set up COMPUTE_PGM_RSRC2.

Bits	Size	Field Name	Description
0	1 bit	enable_sgpr_private_segment_wave_byte_offset	Enable the setup of the SGPR wave scratch offset system register (see 2.9.8). Used by CP to set up COMPUTE_PGM_RSRC2.SCRATCH_EN.
5:1	5 bits	user_sgpr_count	The total number of SGPR user data registers requested. This number must match the number of user data registers enabled.
6	1 bit	enable_trap_handler	Code contains a TRAP instruction which requires a trap hander to be enabled. Used by CP to set up COMPUTE_PGM_RSRC2.TRAP_PRESENT. Note that CP shuld set COMPUTE_PGM_RSRC2.TRAP_PRESENT if either this field is 1 or if amd_queue.enable_trap_handler is 1 for the queue executing the kernel dispatch.
7	1 bit	enable_sgpr_workgroup_id_x	Enable the setup of Work-Group Id X. Also used by CP to set up COMPUTE_PGM_RSRC2.TGID_X_EN.
8	1 bit	enable_sgpr_workgroup_id_y	Enable the setup of Work-Group Id Y. Also used by CP to set up COMPUTE_PGM_RSRC2.TGID_Y_EN, TGID_Z_EN.
9	1 bit	enable_sgpr_workgroup_id_z	Enable the setup of Work-Group Id Z. Also used by CP to set up COMPUTE_PGM_RSRC2. TGID_Z_EN.
10	1 bit	enable_sgpr_workgroup_info	Enable the setup of Work-Group Info.
12:11	2 bits	enable_vgpr_workitem_id	Enable the setup of Work-Item Id X, Y, Z. Also used by CP to set up COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT.
13	1 bit	enable_exception_address_watch	Wavefront starts execution with specified exceptions enabled. Used by CP to set up COMPUTE_PGM_RSRC2.EXCP_EN_MSB (composed from following bits). Address Watch - TC (L1) has witnessed a thread access an "address of interest".
14	1 bit	enable_exception_memory_violation	Memory Violation - a memory violation has occurred for this wave from L1 or LDS (write-to-read-only-memory, mis-aligned atomic, LDS address out of range, illegal address, etc.).
23:15	9 bits	granulated_lds_size	Amount of group segment (LDS) to allocate for each work-group. Granularity is device specific. CP should use the rounded value from the dispatch packet, not this value, as the dispatch may contain dynamically allocated group segment memory. This field is set to 0 by the Finalizer and CP will write directly to COMPUTE_PGM_RSRC2.LDS_SIZE.
24	1 bit	enable_exception_ieee_754_fp_invalid_operation	Enable IEEE 754 FP Invalid Operation exception at start of wavefront execution. enable_exception flags are used by CP to set up COMPUTE_PGM_RSRC2.EXCP_EN (set from bits 0..6), EXCP_EN_MSB (set from bits 7..8).
25	1 bit	enable_exception_fp_denormal_source	Enable FP Denormal exception at start of wavefront execution.
26	1 bit	enable_exception_ieee_754_fp_division_by_zero	Enable IEEE 754 FP Division by Zero exception at start of wavefront execution.
27	1 bit	enable_exception_ieee_754_fp_overflow	Enable IEEE 754 FP FP Overflow exception at start of wavefront execution.
28	1 bit	enable_exception_ieee_754_fp_underflow	Enable IEEE 754 FP Underflow exception at start of wavefront execution.
29	1 bit	enable_exception_ieee_754_fp_inexact	Enable IEEE 754 FP Inexact exception at start of wavefront execution.
30	1 bit	enable_exception_int_divide_by_zero	Enable Integer Division by Zero (rcp_iflag_f32 instruction only) exception at start of wavefront execution.
31	1 bit		Reserved. Must be 0.
32			Total size 4 bytes.

AMD Machine Kind amd_machine_kind_t

Enumeration Name	Value	Description
AMD_MACHINE_KIND_UNDEFINED	0	Machine kind is undefined.
AMD_MACHINE_KIND_AMDGPU	1	Machine kind is AMD GPU. Corresponds to AMD GPU ISA architecture of AMDGPU.

Float Round Mode amd_float_round_mode_t

Enumeration Name	Value	Description
AMD_FLOAT_ROUND_MODE_NEAR_EVEN	0	Round Ties To Even
AMD_FLOAT_ROUND_MODE_PLUS_INFINITY	1	Round Toward +infinity
AMD_FLOAT_ROUND_MODE_MINUS_INFINITY	2	Round Toward -infinity
AMD_FLOAT_ROUND_MODE_ZERO	3	Round Toward 0

Denorm Mode amd_float_denorm_mode_t

Enumeration Name	Value	Description
AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST	0	Flush Source and Destination Denorms
AMD_FLOAT_DENORM_MODE_FLUSH_DST	1	Flush Output Denorms
AMD_FLOAT_DENORM_MODE_FLUSH_SRC	2	Flush Source Denorms
AMD_FLOAT_DENORM_MODE_FLUSH_NONE	3	No Flush

PCIe Gen3 Atomic Operations

PCI Express Gen3 defines 3 PCIe transactions, each of which carries out a specific Atomic Operation:

FetchAdd (Fetch and Add)
Swap (Unconditional Swap)
CAS (Compare and Swap)

For compute capabilities supporting PCIe Gen3 atomics, system scope atomic operations use the following sequences:

Atomic Load/Store: FLAT_LOAD_DWORD* / FLAT_STORE_DWORD* / TLP MRd / MWr
Atomic add: FLAT_ATOMIC_ADD / TLP FetchAdd
Atomic sub: FLAT_ATOMIC_ADD + negate/ TLP FetchAdd
Atomic swap: FLAT_ATOMIC_SWAP / TLP Swap
Atomic compare-and-swap: FLAT_ATOMIC_CMPSWAP / TLP CAS
Other Atomic RMW operations: (max, min, and, or, xor, wrapinc, wrapdec): CAS loop

PCIe Gen3 atomics are only supported on certain hardware configurations, for example, Haswell system.

AMD Queue

HSA AQL Queue Object hsa_queue_t

HSA Queue Object is defined in "HSA Platform System Architecture Specification". AMD HSA Queue handle is a pointer to amd_queue_t.

AMD AQL Queue Object amd_queue_t

The AMD HSA Runtime implementation uses the AMD Queue object (amd_queue_t) to implement AQL queues. It begins with the HSA Queue object, and then has additional information contiguously afterwards that is AMD device specific. The AMD device specific information is accessible by the AMD HSA Runtime, CP and kernel machine code.

The AMD Queue object must be allocated on 64 byte alignment. This allows CP microcode to fetch fields using cache line addresses. The entire AMD Queue object must not span a 4GiB boundary. This allows CP to save a few instructions when calculating the base address of amd_queue_t from &(amd_queue_t.read_dispatch_id) and amd_queue_t.read_dispatch_id_field_base_offset.

For GFX8 and earlier systems, only HSA Queue type SINGLE is supported.

Bits	Size	Name	Description
319:0	40 bytes	hsa_queue	HSA Queue object
447:320	16 bytes		Unused. Allows hsa_queue_t to expand but still keeps write_dispatch_id, which is written by the producer (often the host CPU), in the same cache line. Must be 0.
511:448	8 bytes	write_dispatch_id	64-bit index of the next packet to be allocated by application or user-level runtime. Initialized to 0 at queue creation time.
512			Start of cache line for fields accessed by kernel machine code isa.
543:512	4 bytes	group_segment_aperture_base_hi	For HSA64, the most significant 32 bits of the 64 bit group segment flat address aperture base. This is the same value as {SH_MEM_BASES:PRIVATE_BASE[15:13], 29b0}. For HSA32, the 32 bits of the 32 bit group segment flat address aperture. This is the same value as {SH_MEM_BASES:SHARED_BASE[15:0], 16b0}.
575:544	4 bytes	private_segment_aperture_base_hi	For HSA64, the most significant 32 bits of the 64 bit private segment flat address aperture base. This is the same value as {SH_MEM_BASES:PRIVATE_BASE[15:13], 28b0, 1b1} For HSA32, the 32 bits of the 32 bit private segment flat address aperture base. This is the same value as {SH_MEM_BASES:PRIVATE_BASE[15:0], 16b0}.
607:576	4 bytes	max_cu_id	The number of compute units on the agent to which the queue is associated.
639:608	4 bytes	max_wave_id	The number of wavefronts that can be executed on a single compute unit of the device to which the queue is associated.
703:640	8 bytes	max_legacy_doorbell_dispatch_id_plus_1	For AMD_SIGNAL_KIND_LEGACY_DOORBELL, maximum value of write_dispatch_id signaled for the queue. This value is always 64-bit and never decreases.
735:704	4 bytes	legacy_doorbell_lock	For AMD_SIGNAL_KIND_LEGACY_DOORBELL, atomic variable used to protect critical section which updates the doorbell related fields. Initialized to 0, and set to 1 to lock the critical section.
1023:736	36 bytes		Padding to next cache line. Unused and must be 0.
1024			Start of cache line for fields accessed by the packet processor (CP micro code).
1087:1024	8 bytes	read_dispatch_id	64-bit index of the next packet to be consumed by compute unit hardware. Initialized to 0 at queue creation time.
1119:1088	4 bytes	read_dispatch_id_field_base_byte_offset	Byte offset from the base of hsa_queue_t to the read_dispatch_id field. The CP microcode uses this and CP_HQD_PQ_RPTR_REPORT_ADDR[_HI] to calculate the base address of hsa_queue_t when amd_kernel_code_t.enable_sgpr_dispatch_ptr is set. This field must immediately follow read_dispatch_id. This allows the layout above the read_dispatch_id field to change, and still be able to get the base of the hsa_queue_t, which is needed to return if amd_kernel_code_t.enable_sgpr_queue_ptr is requested. These fields are defined by HSA Foundation and so could change. CP only uses fields below read_dispatch_id which are defined by AMD.
1536			Start of next cache line for fields not accessed under normal conditions by the packet processor (CP micro code). These are kept in a single cache line to minimize memory accesses performed by CP micro code.
2048			Total size 256 bytes.

Queue operations

A queue has an associated set of high-level operations defined in "HSA Runtime Specification" (API functions in host code) and "HSA Programmer Reference Manual Specification" (kernel code).

The following is informal description of AMD implementation of queue operations (all use memory scope system, memory order applies):

Load Queue Write Index: Atomic load of read_dispatch_id field
Store Queue Write Index: Atomic store of read_dispatch_id field
Load Queue Read Index: Atomic load of write_dispatch_id field
Store Queue Read Index: Atomic store of read_dispatch_id field
Add Queue Write Index: Atomic add of write_dispatch_id field
Compare-And-Swap Queue Write Index: Atomic CAS of write_dispatch_id field

Signals

Signals overview

Signal handle is 8 bytes. AMD signal handle is a pointer to AMD Signal Object (amd_signal_t).

The following operations are defined on HSA Signals:

Signal Load
- Read the of the current value of the signal
- Optional acquire semantics on the signal value
Signal Wait on a condition
- Blocks the thread until the requested condition on the signal value is observed
- Condition: equals, not-equals, greater, greater-equals, lesser, lesser-equals
- Optional acquire semantics on the signal value
- Returns the value of the signal that caused it to wake
Signal Store
- Optional release semantics on the signal value
Signal Read-Modify-Write Atomics (add, sub, increment, decrement, min, max, and, or, xor, exch, cas)
- These happen immediately and atomically
- Optional acquire-release semantics on the signal value

Signal kind amd_signal_kind_t

ID	Name	Description
0	AMD_SIGNAL_KIND_INVALID	An invalid signal.
1	AMD_SIGNAL_KIND_USER	A regular signal
-1	AMD_SIGNAL_KIND_DOORBELL	Doorbell signal with hardware support
-2	AMD_SIGNAL_KIND_LEGACY_DOORBELL	Doorbell signal with hardware support, legacy (GFX7/GFX8)

Signal object amd_signal_t

An AMD Signal object must always be 64 byte aligned to ensure it cannot span a page boundary. This is required by CP microcode which optimizes access to the structure by only doing a single SUA (System Uniform Address) translation when accessing signal fields. This optimization is used in GFX8.

Bits	Size	Name	Description
63:0	8 bytes	kind	Signal kind
127:64	8 bytes	value	For AMD_SIGNAL_KIND_USER: signal payload value. In small machine model only the lower 32 bits is used, in large machine model all 64 bits are used.
127:64	8 bytes	legacy_hardware_doorbell_ptr	For AMD_SIGNAL_KIND_LEGACY_DOORBELL: pointer to the doorbell IOMMU memory (write-only). Used for hardware notification in Signal Store.
127:64	8 bytes	hardware_doorbell_ptr	For AMD_SIGNAL_KIND_DOORBELL: pointer to the doorbell IOMMU memory(write-only). Used for hardware notification in Signal Store.
191:128	8 bytes	event_mailbox_ptr	For AMD_SIGNAL_KIND_USER: mailbox address for event notification in Signal operations.
223:192	4 bytes	event_id	For AMD_SIGNAL_KIND_USER: event id for event notification in Signal operations.
255:224	4 bytes		Padding. Must be 0.
319:256	8 bytes	start_ts	Start of the AQL packet timestamp, when profiled.
383:320	8 bytes	end_ts	End of the AQL packet timestamp, when profiled.
448:384	8 bytes	queue_ptr	For AMD_SIGNAL_KIND_*DOORBELL: the address of the associated amd_queue_t, otherwise reserved and must be 0.
511:448	8 bytes		Padding to 64 byte size. Must be 0.
512			Total size 64 bytes

Signal kernel machine code

As signal kind is determined by kind field of amd_signal_t, instruction sequence for signal operation must branch on signal kind.

The following is informal description of signal operations:

For AMD_SIGNAL_KIND_USER kind:
- Signal Load uses atomic load from value field of corresponding amd_signal_t (memory order applies, memory scope system).
- Signal Wait
  - Uses poll loop on signal value.
  - s_sleep ISA instruction provides hint to the SQ to not to schedule the wave for a specified time.
  - s_memtime/s_memrealtime instruction is used to measure time (as signal wait is required to time out in reasonable time interval even if condition is not met).
- Signal Store/Signal Atomic uses the following sequence:
  - Corresponding atomic operation on signal value (memory scope system, memory order applies).
  - Load mailbox address from event_mailbox_ptr field.
  - If mailbox address is not zero:
    - load event id from event_id field.
    - atomic store of event id to mailbox address (memory scope system, memory order release).
    - s_sendmsg with argument equal to lower 8 bits of event_id.
For AMD_SIGNAL_KIND_LEGACY_DOORBELL:
- Signal Store uses the following sequence:
  - Load queue address from queue_ptr field
  - Acquire spinlock protecting the legacy doorbell of the queue.
    - Load address of the spinlock from legacy_doorbell_lock field of amd_queue_t.
    - Compare-and-swap atomic loop, previous value 0, value to set 1 (memory order acquire, memory scope system).
    - s_sleep ISA instruction provides hint to the SQ to not to schedule the wave for a specified time.
  - Use value+1 as next packet index and initial value for legacy dispatch id. GFX7/GFX8 hardware expects packet index to point beyond the last packet to be processed.
  - Atomic store of next packet index (value+1) to max_legacy_doorbell_dispatch_id_plus_1 field (memory order relaxed, memory scope system).
  - For small machine model:
    - legacy_dispatch_id = min(write_dispatch_id, read_dispatch_id + hsa_queue.size)
  - For GFX7:
    - Load queue size from hsa_queue.size field of amd_queue_t.
    - Wrap packet index to a point within the ring buffer (ring buffer size is twice the size of the HSA queue).
    - Convert legacy_dispatch_id to DWORD count by multiplying by 64/4 = 16.
    - legacy_dispatch_id = (legacy_dispatch_id & ((hsa_queue.size << 1)-1)) << 4;
  - Store legacy dispatch id to the hardware MMIO doorbell.
    - Address of the doorbell is in legacy_hardware_doorbell_ptr field of amd_signal_t.
  - Release spinlock protecting the legacy doorbell of the queue. Atomic store of value 0.
- Signal Load/Signal Wait/Signal Read-Modify-Write Atomics are not supported. Instruction sequence for these operations and this signal kind is empty.
For AMD_SIGNAL_KIND_DOORBELL:
- Signal Store uses the following sequence:
  - Atomic store of value to the hardware MMIO doorbell.
- Signal Load/Signal Wait/Signal Read-Modify-Write Atomics are not supported. Instruction sequence for these operations and this signal kind is empty.

Debugtrap

Debugtrap halts execution of the wavefront and generates debug exception. For more information, refer to "HSA Programmer Reference Manual Specification". debugtrap accepts 32-bit unsigned value as an argument.

The following is a description of debugtrap sequence:

v0 contains 32-bit argument of debugtrap
s[0:1] contains Queue Ptr for the dispatch
s_trap 0x1

References

HSA Standards and Specifications
HSA Platform System Architecture Specification 1.0
HSA Programmer Reference Manual Specification 1.01
HSA Runtime Specification 1.0
AMD ISA Documents
- AMD GCN3 Instruction Set Architecture (2015)
- AMD_Southern_Islands_Instruction_Set_Architecture
ROCR Runtime sources
PCI Express Atomic Operations

Files

AMDGPU-ABI.md

Latest commit

History