I was working on trying to figure out how sqeeze out some more speed for some calculations, which involved the following bit of math.

The language of choice was C# with Visual Studio 2013 for a 64-bit process.

One idea was to try and use SSE to improve the throughput of the computation. Unfortunately the .NET framework currently (as of .NET 4.2) does not generate SSE instructions as part of its JIT compiler. A workaround would be to make a dll with the math function exported, and call that exported function from C# using PInvoke. The Visual Studio compiler (2012 and above) is an excellant optimizing compiler with the ability to auto-vectorize loops. But then I would have to add a C dll as a dependency of my C# application. One more file to keep track just for a single calculation.

But what if I could take the assembly instructions generated for the C code and embed that directly into my C# application?

I got a little obsessed with this idea, so I started investigating it.

I needed to learn some x64 assembly. One way to start would be to write up some C code and look at what the compiler generates.

Learning assembly with a compilers help

64-bit assembly is a wide topic by itself. I had worked with a bit of x86 assembly back in my university days, but that’s about it. There is only one calling convention for 64-bit code, which is how fastcall was with x86 assembly. You can read more about it here, but the jist of it is that fastcall uses registers as the first four parameters, and the stack frame for the rest for the functions parameters.

So I started writing bits of test code and looking at the generated assembly. All the samples I show were tested using Visual Studio 2013 Update 5.

Here is a simple example, just to see how the calling convention works. Test1() takes a int, a double, and pointers of each kind.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include < stdlib.h>     // calloc

void Test1(int one, double two, int* three, double* four)
{
    *three = one + 1;
    *four = two + one;
}

int main(int argc, _TCHAR* argv[])
{
    int one = 1;
    double two = 2.0;

    int* three = (int*)calloc(1, sizeof(int));
    *three = 3;
    double* four = (double*)calloc(1, sizeof(double));
    *four = 4; 

    printf("one: %x two: %x three %x *three: %x four: %x *four: %x\n", (unsigned long)one,
        (unsigned long)two, (unsigned long)three, *three, (unsigned long)four, *four);

    Test1(one, two, three, four);

    printf("one: %x two: %x three %x *three: %x four: %x *four: %x\n", (unsigned long)one,
        (unsigned long)two, (unsigned long)three, *three, (unsigned long)four, *four);

    free(three);
    free(four);

    return 0;
} 

This is a useless little program, but it will tell me how values and pointers are passed around and set up when calling a function.

Here is what the x64 Debug version looks like.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void Test1(int one, double two, int* three, double* four)
     8: {
000000013F321020 4C 89 4C 24 20       mov         qword ptr [rsp+20h],r9        // double* four
000000013F321025 4C 89 44 24 18       mov         qword ptr [rsp+18h],r8        // int* three
000000013F32102A F2 0F 11 4C 24 10    movsd       mmword ptr [rsp+10h],xmm1     // double two
000000013F321030 89 4C 24 08          mov         dword ptr [rsp+8],ecx         // int one
000000013F321034 57                   push        rdi                           // non-volatile register
     9:     *three = one + 1;
000000013F321035 8B 44 24 10          mov         eax,dword ptr [rsp+10h]       // load up one
000000013F321039 FF C0                inc         eax                           // and add 1 to it
000000013F32103B 48 8B 4C 24 20       mov         rcx,qword ptr [rsp+20h]       
000000013F321040 89 01                mov         dword ptr [rcx],eax           // move that value to three
    10:     *four = two + one;
000000013F321042 F2 0F 2A 44 24 10    cvtsi2sd    xmm0,dword ptr [rsp+10h]      // convert int to double. 
                                                                                // Again, this is not optimized
000000013F321048 F2 0F 10 4C 24 18    movsd       xmm1,mmword ptr [rsp+18h]     // move one
000000013F32104E F2 0F 58 C8          addsd       xmm1,xmm0                     // add them both
000000013F321052 0F 28 C1             movaps      xmm0,xmm1                     // move result to xmm0
000000013F321055 48 8B 44 24 28       mov         rax,qword ptr [rsp+28h]       // set the result to *four          
000000013F32105A F2 0F 11 00          movsd       mmword ptr [rax],xmm0  
    11: }
000000013F32105E 5F                   pop         rdi                           // restore register
000000013F32105F C3   

The pointers are passed around via the r9 and r8 registers. The 32-bit int was passed in via ecx and the double via XMM1 registers.

Let’s take a look at what the optimized looks like. I have to modify the code a little to fight against inlining by modifying the function prototype a bit by specifying noinline.

Here is what you get:

1
2
3
4
5
6
7
8
_declspec(noinline) void Test1(int one, double two, int* three, double* four) 
     9:     *three = one + 1;
    10:     *four = two + one;
000000013FEC1000 F2 0F 58 0D F8 11 00 00    addsd       xmm1,mmword ptr [3FEC2200h]  
000000013FEC1008 41 C7 00 02 00 00 00       mov         dword ptr [r8],2  
000000013FEC100F F2 41 0F 11 09             movsd       mmword ptr [r9],xmm1  
    11: }
000000013FEC1014 C3                         ret    

Small, compact and to the point, which is what you would expect from a optimizing compiler. This is the result of whole program optimization. The values are being read from memory locations directly, and most of the operation has been optimized out. This is not what I need.

If I did want to generate some assembly code for use via C#, I would have to use a modified version of the non-optimized code for it to avoid making use of hard-coded memory references for storing data and instead, loading up data from registers.

Maybe I should look at something similar to my original problem, which is manipulating arrays of data.

How about something like this

1
2
3
4
5
6
7
8
9
_declspec(noinline) double SumOfSquares(int* input, size_t size)
{
    double temp = 0;
    for (int i = 0; i < size; i++)
    {
        temp += (input[i] * input[i]);
    }
    return temp;
} 

Here is a debug build’s output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
 _declspec(noinline) double SumOfSquares(int* input, size_t size)
    14: {
000000013FEB3410 48 89 54 24 10       mov         qword ptr [temp],rdx      // saving registers on the
000000013FEB3415 48 89 4C 24 08       mov         qword ptr [rsp+8],rcx     // stack for debugging easy
000000013FEB341A 57                   push        rdi                       
000000013FEB341B 48 83 EC 20          sub         rsp,20h                   // space for local variables 
000000013F22341F 48 8B FC             mov         rdi,rsp  
000000013F223422 B9 08 00 00 00       mov         ecx,8  
000000013F223427 B8 CC CC CC CC       mov         eax,0CCCCCCCCh  
000000013F22342C F3 AB                rep stos    dword ptr [rdi]           // clear the new stack home space 
000000013F22342E 48 8B 4C 24 30       mov         rcx,qword ptr [rsp+30h]   // input
    15:     double temp = 0;
000000013F223433 0F 57 C0             xorps       xmm0,xmm0                 // clear 
000000013F223436 F2 0F 11 44 24 10    movsd       mmword ptr [rsp+10h],xmm0 // temp = 0 
    16:     for (int i = 0; i < size; i++)
000000013F22343C C7 44 24 18 00 00 00 00 mov         dword ptr [rsp+18h],0  // i= 0
000000013F223444 EB 0A                jmp         000000013F223450  
000000013F223446 8B 44 24 18          mov         eax,dword ptr [rsp+18h]  
000000013F22344A FF C0                inc         eax  
000000013F22344C 89 44 24 18          mov         dword ptr [rsp+18h],eax  
000000013F223450 48 63 44 24 18       movsxd      rax,dword ptr [rsp+18h]   // rax = i 
000000013F223455 48 3B 44 24 38       cmp         rax,qword ptr [rsp+38h]   // cmp i with size 
000000013F22345A 73 35                jae         000000013F223491          // jmp if i >= size 
    17:     {
    18:         temp += (input[i] * input[i]);
000000013F22345C 48 63 44 24 18       movsxd      rax,dword ptr [rsp+18h]   // rax = i 
000000013F223461 48 63 4C 24 18       movsxd      rcx,dword ptr [rsp+18h]   // rcx = i 
000000013F223466 48 8B 54 24 30       mov         rdx,qword ptr [rsp+30h]   // rdx = input 
000000013F22346B 4C 8B 44 24 30       mov         r8,qword ptr [rsp+30h]    // r8 = input 
000000013F223470 8B 04 82             mov         eax,dword ptr [rdx+rax*4] // input[i] 
000000013F223473 41 0F AF 04 88       imul        eax,dword ptr [r8+rcx*4]  // res =input[i] * input[i] 
000000013F223478 F2 0F 2A C0          cvtsi2sd    xmm0,eax                  // xmm0 = (double)res 
000000013F22347C F2 0F 10 4C 24 10    movsd       xmm1,mmword ptr [rsp+10h] // xmm1 = temp 
000000013F223482 F2 0F 58 C8          addsd       xmm1,xmm0                 // xmm += res 
000000013F223486 0F 28 C1             movaps      xmm0,xmm1  
000000013F223489 F2 0F 11 44 24 10    movsd       mmword ptr [rsp+10h],xmm0 // save res to temp
    19:     }
000000013F22348F EB B5                jmp         000000013F223446          // repeat the loop 
    20:     return temp;
000000013F223491 F2 0F 10 44 24 10    movsd       xmm0,mmword ptr [rsp+10h] // mov temp to return value 
    21: }
000000013F223497 48 83 C4 20          add         rsp,20h  
000000013F22349B 5F                   pop         rdi  
000000013F22349C C3                   ret  

I got a lot of help from this article about understanding optimized x64 assembly. There are several things in this code that would make our lives easier if we ever had to debug this code. For example, copies of the function parameters are saved in the stack. Guard bytes (0xCCCCCCCC) are used when initializing the extra stack space for temporary variables on the stack.

Let’s repeat the process now, but for the release version. I had to make some changes in the function to prevent it from being inlined. The release version is build with /Ox (Full optimization) turned on. And since the purpose of this exercise is to speed things up, I want to make sure my loop gets vectorized. So I turned on the vectorization report to verify this, and also used switched the floating model to /fp:fast

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
_declspec(noinline) double SumOfSquares(int* input, size_t size)
    14: {
000000013F881000 48 89 5C 24 08       mov         qword ptr [rsp+8],rbx    // save size on stack
000000013F881005 57                   push        rdi                      // non-volatile register 
000000013F881006 48 83 EC 20          sub         rsp,20h                  // make space for temp vars 
000000013F88100A 48 8B DA             mov         rbx,rdx  
000000013F88100D 48 8B F9             mov         rdi,rcx  
    15:     // add a rand() call here to prevent inlining
    16:     double temp = rand();
000000013F881010 FF 15 0A 11 00 00    call        qword ptr [3F882120h]     // ignore this
    17:     temp = 0;
    18:     for (int i = 0; i < size; i++)
000000013F881016 33 D2                xor         edx,edx                   // clear the result of rand()
000000013F881018 0F 57 D2             xorps       xmm2,xmm2                 // clear 
000000013F88101B 44 8B C2             mov         r8d,edx  
000000013F88101E 48 83 FB 04          cmp         rbx,4                     // cmp size and 4 
000000013F881022 72 62                jb          000000013F881086          // jmp if size < 4 
000000013F881024 83 3D F1 1F 00 00 02 cmp         dword ptr [__isa_available],2
000000013F88102B 7C 59                jl          000000013F881086          // jmp if not supported, which is true in
                                                                            // my case 
000000013F88102D 48 8B C3             mov         rax,rbx  
000000013F881030 48 8B CB             mov         rcx,rbx  
.....   // SSE 4.2 code snipped since I can't run it
000000013F881082 F2 0F 58 D0          addsd       xmm2,xmm0  
000000013F881086 49 63 C0             movsxd      rax,r8d                   // jmp here if SSE4.2
                                                                            // is not supported 
                                                                            // r8d = i, rax = i
000000013F881089 48 3B C3             cmp         rax,rbx                   // compare size and i 
000000013F88108C 73 32                jae         000000013F8810C0          // jmp if above 
000000013F88108E 48 8D 0C 87          lea         rcx,[rdi+rax*4]           // rcx = input[rax] 
000000013F881092 0F 1F 40 00          nop         dword ptr [rax]           // not sure about the 
    17:     temp = 0;                                                       // purpose of all
    18:     for (int i = 0; i < size; i++)                                  // these nops
000000013F881096 66 66 0F 1F 84 00 00 00 00 00 nop         word ptr [rax+rax+00000000h]  
000000013F8810A0 8B 01                mov         eax,dword ptr [rcx]       // eax = input[i]
000000013F8810A2 41 FF C0             inc         r8d                       // i++ 
000000013F8810A5 48 83 C1 04          add         rcx,4  
    19:     {
    20:         temp += (input[i] * input[i]);
000000013F8810A9 0F AF C0             imul        eax,eax                   // signed int multiply
000000013F8810AC 66 0F 6E C8          movd        xmm1,eax                  // cache result 
000000013F8810B0 49 63 C0             movsxd      rax,r8d                   // rax = current i 
000000013F8810B3 F3 0F E6 C9          cvtdq2pd    xmm1,xmm1                 // convert to double 
000000013F8810B7 F2 0F 58 D1          addsd       xmm2,xmm1                 // save to temp 
000000013F8810BB 48 3B C3             cmp         rax,rbx  
000000013F8810BE 72 E0                jb          000000013F8810A0  
    21:     }
    22:     return temp;
000000013F8810C0 0F 28 C2             movaps      xmm0,xmm2                 // save temp result 
    23: }
000000013F8810C3 48 8B 5C 24 30       mov         rbx,qword ptr [rsp+30h]   // rbx = input 
000000013F8810C8 48 83 C4 20          add         rsp,20h                   // clear up stack 
000000013F8810CC 5F                   pop         rdi  
000000013F8810CD C3                   ret  

__isa_available checks to see if SSE4.2 is supported. My current CPU does not support SSE4.2 so the code path skips all that goodness. If I have made temp an int instead of a double, the compiler would have generated the SSE2 integer arithmetic instruction set. I actually did confirm this, but let’s continue with what we have so far

Generate assembly byte code

Based on what I read earlier, I whipped up the following bit of code to calculate the sum of squares.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# compile using gcc
.intel_syntax noprefix
.data
.text
.global main
main:
# double sum(int* input, size_t size)
#	rcx = input
#	rdx = size
push rdi
mov rbx, rdx
mov rdi, rcx
xor edx, edx
xorps xmm2, xmm2	
mov r8d, edx
movsxd rax, r8d
lea rcx, [rdi+rax*4]
@calcbegin:
mov eax, dword ptr [rcx]
inc r8d
add rcx, 4
imul eax, eax
movd xmm1, eax
movsxd rax, r8d
cvtdq2pd xmm1, xmm1
addsd xmm2, xmm1
cmp rax, rbx
jb @calcbegin
movaps xmm0, xmm2
pop rdi
ret    

There is no SSE here, but it should work. Now all I need is to assemble this, and generate the byte code. I used gcc for this. I generate the .o file, and dump it’s disassembly into a text file

1
gcc -g -c simplelooptest.s -m64 && objdump -d -M intel simplelooptest.o > test.out

Manually parsing the dissassembly for the byte code is a pain, so I wrote up a quick F# script that does that for me. You can see the script here. Just give it the full path to test.out and it prints out the byte array to the debugger.

How do you actually use the byte code in C#? You need to allocate executable memory space using VirtualAlloc and copy the byte array there. Then use Marshal.GetDelegateForFunctionPointer() to actually execute the instructions with the correct method signature.

I wrote a simple console application that shows this.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;

namespace ConsoleTest
{
   internal class Program
   {
      #region Native stuff

      [Flags]
      public enum AllocationType : uint
      {
         COMMIT = 0x1000,
         RESERVE = 0x2000,
         RESET = 0x80000,
         LARGE_PAGES = 0x20000000,
         PHYSICAL = 0x400000,
         TOP_DOWN = 0x100000,
         WRITE_WATCH = 0x200000
      }

      [Flags]
      public enum MemoryProtection : uint
      {
         EXECUTE = 0x10,
         EXECUTE_READ = 0x20,
         EXECUTE_READWRITE = 0x40,
         EXECUTE_WRITECOPY = 0x80,
         NOACCESS = 0x01,
         READONLY = 0x02,
         READWRITE = 0x04,
         WRITECOPY = 0x08,
         GUARD_Modifierflag = 0x100,
         NOCACHE_Modifierflag = 0x200,
         WRITECOMBINE_Modifierflag = 0x400
      }

      public enum MemoryFreeType : uint
      {
         MEM_DECOMMIT = 0x4000,
         MEM_RELEASE = 0x8000
      }

      [DllImport("kernel32.dll", SetLastError = true)]
      private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize,
         AllocationType flAllocationType, MemoryProtection flProtect);

      [DllImport("kernel32.dll", SetLastError = true)]
      private static extern bool VirtualProtect(IntPtr lpAddress, uint dwSize,
         MemoryProtection flNewProtect, out MemoryProtection lpflOldProtect);

      [DllImport("kernel32.dll", SetLastError = true)]
      private static extern bool VirtualFree(IntPtr lpAddress, UIntPtr dwSize,
         MemoryFreeType dwFreeType);

      #endregion

      private delegate double SumOfSquaresDelegate(int[] input, int size);

      /// <summary>
      /// Create the buffer for executable memory and return a ptr to it
      /// </summary>
      /// <param name="data">The byte code</param>
      /// <returns>The ptr to executable memory</returns>
      private static IntPtr SetupBuffer(byte[] data)
      {
         var codeBuffer = VirtualAlloc(IntPtr.Zero, new UIntPtr((uint)data.Length),
            AllocationType.COMMIT | AllocationType.RESERVE,
            MemoryProtection.READWRITE);

         Marshal.Copy(data, 0, codeBuffer, data.Length);
         MemoryProtection oldProtection;
         VirtualProtect(codeBuffer, (uint)data.Length, MemoryProtection.EXECUTE_READ, out oldProtection);
         return codeBuffer;
      }

      /// <summary>
      /// Free memory allocated by VirtualAlloc
      /// </summary>
      /// <param name="ptr">ptr to VirtualAlloc memory</param>
      private static void FreeBuffer(IntPtr ptr)
      {
         VirtualFree(ptr, UIntPtr.Zero, MemoryFreeType.MEM_RELEASE);
      }

      private static void Main(string[] args)
      {
         BenchmarkSumOfSquares();
         Console.ReadLine();
         return;
      }

      public static void BenchmarkSumOfSquares()
      {
        // The byte code for the bit of assembly listed above.
         byte[] data = { 0x57, 0x48, 0x89, 0xd3, 0x48, 0x89, 0xcf, 0x31, 0xd2, 0x0f, 0x57, 0xd2, 0x41, 0x89, 0xd0, 0x49, 0x63, 0xc0, 0x48, 0x8d, 0x0c, 0x87, 0x8b, 0x01, 0x41, 0xff, 0xc0, 0x48, 0x83, 0xc1, 0x04, 0x0f, 0xaf, 0xc0, 0x66, 0x0f, 0x6e, 0xc8, 0x49, 0x63, 0xc0, 0xf3, 0x0f, 0xe6, 0xc9, 0xf2, 0x0f, 0x58, 0xd1, 0x48, 0x39, 0xd8, 0x72, 0xe0, 0x0f, 0x28, 0xc2, 0x5f, 0xc3, 0x90, 0x90, 0x90, 0x90, 0x90, };

         var buffer = SetupBuffer(data);
         var sumSquaresDelegate = (SumOfSquaresDelegate)Marshal.GetDelegateForFunctionPointer(buffer, typeof(SumOfSquaresDelegate));
         const int size = 50000;
         int[] input = new int[size];
         var rand = new Random(13);
         for (int i = 0; i < size; i++)
         {
            input[i] = rand.Next(10);
         }

         double res = 0, managedRes = 0;

         Stopwatch sw = new Stopwatch();
         sw.Start();
         for (int i = 0; i < 10000; i++)
         {
            res = sumSquaresDelegate(input, size);
         }
         sw.Stop();
         Console.WriteLine("assembly time: {0}", sw.Elapsed.TotalMilliseconds);

         sw.Restart();
         for (int i = 0; i < 10000; i++)
         {
            managedRes = SumOfSquaresManaged(input);
         }
         sw.Stop();
         Console.WriteLine("managed time: {0}", sw.Elapsed.TotalMilliseconds);

         FreeBuffer(buffer);
         Console.WriteLine("sum of squares managed: {0}", managedRes);
         Console.WriteLine("sum of squares assembly: {0}", res);
      }

      /// <summary>
      /// C# version of sum of squares
      /// </summary>
      /// <param name="input">input data</param>
      /// <returns>sum of squares</returns> 
      public static double SumOfSquaresManaged(int[] input)
      {
         double temp = 0;
         foreach (var item in input)
         {
            temp += (item * item);
         }
         return temp;
      }
   }
}

This console app also benchmarks the byte code with it’s C# equivalent. Notice I’m not using unsafe C# here. I probably should have since will generate code closer what the byte code looks like. It runs the two versions of the sum of squares a bunch of times and calculates the amount of time each run took.

Here is the output of the console app for a release build.

1
2
3
4
assembly time: 1349.824
managed time: 3829.5137
sum of squares managed: 1432310
sum of squares assembly: 1432310

Looks like the assembly version is about 3X faster than the plain C# version. I didn’t expect there to be such a huge improvement. I wouldn’t recommend using this approach for production code. It would be very tedious to maintain this manually, and I would trust the .NET JIT to improve the code performance over time without me having to rewrite the C# code. This was purely an academic exercise on my part.

Next thing I’ll try is to use SSE2 intrinsics and embed the byte code for that. But that’s for another post.