Skip to content
mikerabat edited this page Sep 19, 2018 · 1 revision

How to use AVX in Delphi

So at some point I needed to decide if I ever stuck with SSE since that is the latest assembler op codes Delphi knows or go further and check if there is a chance to include AVX and FMA cpu capabilities in the library and make it even faster.

And everything included in the sources with hardly any external tools...

There were a few options to use an external assembler like NASM but actually I found it easier to use a simpler workaround aka I used FPC since my library should also work with this compiler toolchain.

So... the steps to develop AVX/FMA opcodes in the Delphi world were:

  • Create a Unit testing project for Lazarus/CodeTyphoon that does the same as the Delphi project
  • Create the AVX/FMA functions using FPC.
  • Disassemble the object files
  • For each AVX/FMA opcode I needed to automatically find the equivalent "db" statement so I was able to use it in Delphi

Porting the library to FPC

There are a few differences regarding the FPC and the Delphi compiler - generally the Delphi compiler is waaay relaxter on warnings and syntax. One can ease the "pain" a bit by defining

{$IFDEF FPC} {$MODE DELPHI} {$ENDIF}

At the top of a unit. In this mode:

  • A function pointer can be assigned without the @ operator.
  • Parameter naming is not so strict. Otherwise method parameter cannot be named like properties.
  • A forward declaration does not have to be repeated exactly the same by the implementation of a function/procedure. In particular, you can omit the parameters when implementing the function or procedure.

There are a few more but they were not relevant for this project.

It was also necessary to create a new unit testing project that is compatible with Delphi2010 DUNIT framework. Luckily CodeTyphoon offers a similar project type - FPCUnit Test Application. A few modifications needed to be done though:

{$IFDEF FPC} testregistry {$ELSE} {$IFDEF FMX}DUnitX.TestFramework {$ELSE}TestFramework {$ENDIF} {$ENDIF}

The line above includes the proper units for all 3 types of projects that are now supported in unit test frameworks:

  • DUnit from Delphi 2007 and above
  • DUNITX for FMX
  • FPCUnit

FMX actually needs another attribute that decorates the classe: {$IFDEF FMX} [TestFixture] {$ENDIF} This ensures the class is included in the unit tests.

All the classes are derrived from the same base class TBaseMatrixTestCase. In this class some differences between unit test suits are balanced so we can have the same methods in all the cases.

The header of this class looks like:

{$IFNDEF FPC}
{$IFDEF MSWINDOWS}
Windows,
{$ENDIF}
TestFramework,
{$ELSE}
fpcunit, testregistry,
{$IFDEF MSWINDOWS} Windows, {$ELSE} lcltype, {$ENDIF}
{$ENDIF}
Classes, SysUtils, Types, Matrix, MatrixConst;

type
TBaseMatrixTestCase = class(TTestCase)
protected
 function BaseDataPath : string;

 {$IFDEF FPC}
 procedure CheckEqualsMem(p1, p2 : PByte; memSize : integer; const msg : string);
 procedure Status(const msg : string);
 {$ENDIF}

 procedure FillMatrix(mtxSize : integer; out x, y : TDoubleDynArray; out p1, p2 : PDouble);
 procedure AllocAlignedMtx(mtxSize : integer; out pX : PDouble; out pMem : PByte); overload;
 procedure AllocAlignedMtx(mtxWidth, mtxHeight : integer; out pX : PDouble; out pMem : PByte; out LineWidthAligned : TASMNativeInt); overload;
 procedure FillAlignedMtx(mtxSize : integer; out pX : PDouble; out pMem : PByte); overload;
 procedure FillAlignedMtx(mtxWidth, mtxHeight : integer; out pX : PDouble; out pMem : PByte; out LineWidthAligned : TASMNativeInt); overload;
 procedure FillUnalignedMtx(mtxWidth, mtxHeight : integer; out pX : PDouble; out pMem : PByte; out LineWidthAligned : TASMNativeInt);
 procedure WriteMatlabData(const fileName : string; const data : Array of double; width : integer);
 procedure TryClearCache;
 function WriteMtx(const data : Array of Double; width : integer; prec : integer = 3) : string; overload;
 function WriteMtx(data : PDouble; LineWidth : TASMNativeInt; width : integer; height : integer; prec : integer = 3) : string; overload;
 function WriteMtx(mtx : TDoubleMatrix; prec : integer = 3) : string; overload;
 function CheckMtx(const data1 : Array of Double; const data2 : Array of double; w : integer = -1; h : integer = -1; const epsilon : double = 1e-4) : boolean;
 function CheckMtxIdx(const data1 : Array of Double; const data2 : Array of double; out idx : integer; w : integer = -1; h : integer = -1; const epsilon : double = 1e-4) : boolean; overload;
 function CheckMtxIdx(data1, data2 : PDouble; const mtxSize : integer; out idx : integer; const epsilon : double = 1e-4) : boolean; overload;
 function CheckMtxIdx(data1, data2 : PDouble; LineWidth1, LineWidth2 : TASMNativeInt; mtxWidth, mtxHeight : integer; out idx : integer; const epsilon : double = 1e-4) : boolean; overload;
 function WriteMtxDyn(const data : TDoubleDynArray; width : integer; prec : integer = 3) : string; overload;

end;

Finally in the initalization section of the units the unit test classes have to be registerd with the framework. An example is:

{$IFNDEF FMX}
RegisterTest(TestMatrixOperations{$IFNDEF FPC}.Suite{$ENDIF});
RegisterTest(TASMMatrixOperations{$IFNDEF FPC}.Suite{$ENDIF});
RegisterTest(TASMatrixBlockSizeSetup{$IFNDEF FPC}.Suite{$ENDIF});
{$ENDIF}

Assembler programming in FPC

Differences in ASM

As I mentioned earlier FPC is stricter in handling asm statements than Delphi is. Especially for x64 assembler there are differences so the follwing section references x64 only:

In Delphi it's only allowed to have full assembler routines which means that after the function definition no "begin end;" but ruther only "asm end;" which includes bare assembler instructions only. In a first try I used "begin end;" enclosed asm statements in FPC unit I finally figured out that one can use the "assembler;" procedure qualifier (like an calling convention qualifier) that allows to write pure assembler routines.

FPC uses stricter variable handling in assembler routines: If you have a function call with many parameters that requires a stack frame and you e.g. reference the first parameter in the routine FPC always uses the stack frame parameter whereas Delphi "optimizes" the variable and uses the register instead. This is especially intersting in x64 where the ABI defines the first 4 (or 6 in Unix) params to be passed in registers - RCX, RDX, R8, R9 are used in Windows. So if you have a function like this:

procedure ASMMatrixAbsAlignedEvenW(Dest : PDouble; const LineWidth, Width, Height : TASMNativeInt);

the params are pased as RCX=Dest, RDX=LineWidth, R8=Width, R9=Height + the params on the stack if stackframes are selected.

A line like:

mov r10, width;

is translated in two different ways in the two different compiler suits: Standard Delphi results in "mov r10, r8" and FPC would be "mov r10, [rbp + $10]".

There is also a difference in addressing global constants between the two in x64. In Delphi one can reference a global constant like cMinusOne implicitly but actually x64 needs to be rip relative referenced.

This results in code like this

const cSignBits : Array[0..1] of int64 = ($7FFFFFFFFFFFFFFF, $7FFFFFFFFFFFFFFF);

movupd xmm0, [rip + cSignBits];

istead of straight forward code like:

movupd xmm0, cSignBits;

Differences in OS

One major problem in the development of the package was compatibility between different OS's especially the big calling convention difference between windows x64 and unix (system V) based x64 systems. The calling convention for windows uses 4 registers (RCX, RDX, R8, R9) and the Unix based (called System V AMD64 ABI) calling convention uses 6 registers (RDI, RSI, RDX, RCX, R8, R9). So there are two x64 worlds - windows and System V (Linux, Unix, FreeBSD, macOs...). Note that floating point registers Xmm0 to xmm4 (and xmm6,7) are used for certain floating point parameters.

So... my simple approach to get both worlds to gether was to write assembler routines only for win64 (the main platform actually) and fix the register order to get it right on the other platforms too. This means that in most cases we need two additional parameters on the stack + the original parameter needs to be renamed.

This means that (most of the time) the following assembler prolog is executed on Unix based systems:

procedure AVXMatrixTransposeAligned(dest : PDouble; const destLineWidth : TASMNativeInt;
    mt : PDouble; const LineWidth : TASMNativeInt; {$ifdef UNIX}unixWidth{$ELSE}width{$endif}, 
   {$ifdef UNIX}unixHeight{$ELSE}height{$endif} : TASMNativeInt); {$IFDEF FPC}assembler;{$ENDIF}
var iR12, iR13, iR14, iR15, iRSI, iRDI : TASMNativeInt;
{$ifdef UNIX}
    width : TASMNativeInt;
    height : TASMNativeInt;
{$ENDIF} 
asm
   {$IFDEF UNIX}
   mov width, r8;
   mov height, r9;
   mov r8, rdx;
   mov r9, rcx;
   mov rcx, rdi;
   mov rdx, rsi;
   {$ENDIF}

  // normal function body 

The parameter renaming is done by an ifdef and the additional stack variables have the original name. This takes all in all a few cycles more but shouldn't matter since the followed assembler function normally executes way longer. Please note that one may not reference the variables in AVX statements since the addresses are different and db statements are only capable of constant referencing - the assembler itself can distinguish between the two.

Some special things to consider

The calling conventions x64 and x86 define also volatile XMM registers which need to be preserved. So diving into the AVX world I had the following problem: As far as I know there is no push operation defined for the XMM registers. For the SSE world I could define

procedure TestASM(a, b,c, d : PDouble); 
var dXMM7 : Array[0..1] of double;
asm
   // preserve
   movupd dXMM7, xmm7;
   ...      
   // restore
   movupd xmm7, dXMM7;
end;

But the equivalent in the AVX world doesn't work any more. I got troubles reaching the correct adress on Unix based systems (due to the static "db" instructions) so I needed to handle the stack myself:

// reserve memory on the stack for dymm8 to dymm11
sub esp, 128;

// init values:
{$IFDEF FPC}vxorpd ymm0, ymm0, ymm0;{$ELSE}db $C5,$FD,$57,$C0;{$ENDIF} 
{$IFDEF FPC}vmovupd [esp + 00], ymm0;{$ELSE}db $C5,$FD,$11,$04,$24;{$ENDIF} 
{$IFDEF FPC}vmovupd [esp + 32], ymm0;{$ELSE}db $C5,$FD,$11,$44,$24,$20;{$ENDIF} 
{$IFDEF FPC}vmovupd [esp + 64], ymm0;{$ELSE}db $C5,$FD,$11,$44,$24,$40;{$ENDIF} 
{$IFDEF FPC}vmovupd [esp + 96], ymm0;{$ELSE}db $C5,$FD,$11,$44,$24,$60;{$ENDIF} 

There are also a few AVX opcodes like vbroadcastsd that I would love to have register to register based but they are only implemented in AVX2 so I decided to work around that problem by introducing a stack variable like this:

var tmp : double;
asm
   // instead one could write in AVX2
   // vbroadcastsd ymm2, xmm0;   

   lea eax, tmp;
   {$IFDEF FPC}vmovsd [eax], xmm0;{$ELSE}db $C5,$FB,$11,$00;{$ENDIF} 
   {$IFDEF FPC}vbroadcastsd ymm2, [eax];    {$ELSE}db $C4,$E2,$7D,$19,$10;{$ENDIF} 

Last but not least there is one major difference between OS's which I only recognized in a sidenote and help from others that resulted in many access violations: On macOs you cannot easily reference global variables defined in other units in assembler... Dah... So what I needed to do is to create local copies of some of the used global constants (like cMinusOne, cMaxDouble, ...) in the units where I needed it. This helped to work around this nasty problem.

Using AVX in Delphi

So far the sections above describe how use AVX/FMA opcodes in FPC - but we want to use it in Delphi too. There are two possibilities for this - use an external assembler llike NASM or use the generic "db" statements to emulate the missing capabilities. I decided to do the latter one since I already got it in FPC. First step in that task is to disassemble the object files which was done with Agner's object file converter tool from http://www.agner.org/optimize/ and put the project executabl in the "AVXPrecompiled" folder. Check the "processPasFiles.bat". Ensure that the "MathUtilTestLinux.ppr" project is compiled with CodeTyphoons 32 and 64 bit compiler including the debug information - this result in object files which can be disassembled with Agner's tool. Make sure the "processPasFiles.bat" file points to the right output directory otherwise please correct it (another instance of "works on my computer ;)")

Build the AVXPortToDelphi.dpr project and let then the batch file do the magic. The AVX Delphi project tries to match the assembler statements with the ones from the disassembler and puts the {$IFDEFS} in place as shown in the last code snipplet.

Note that the parser is very simply but of course can be used in other projects too ;)

Clone this wiki locally