I was working on trying to figure out how sqeeze out some more speed for some calculations, which involved the following bit of math.
The language of choice was C# with Visual Studio 2013 for a 64-bit process.
One idea was to try and use SSE to improve the throughput of the computation. Unfortunately the .NET framework currently (as of .NET 4.2) does not generate SSE instructions as part of its JIT compiler. A workaround would be to make a dll with the math function exported, and call that exported function from C# using PInvoke. The Visual Studio compiler (2012 and above) is an excellant optimizing compiler with the ability to auto-vectorize loops. But then I would have to add a C dll as a dependency of my C# application. One more file to keep track just for a single calculation.
But what if I could take the assembly instructions generated for the C code and embed that directly into my C# application?
I got a little obsessed with this idea, so I started investigating it.
I needed to learn some x64 assembly. One way to start would be to write up some C code and look at what the compiler generates.
Learning assembly with a compilers help
64-bit assembly is a wide topic by itself. I had worked with a bit of x86 assembly back in my university days, but that’s about it. There is only one calling convention for 64-bit code, which is how fastcall was with x86 assembly. You can read more about it here, but the jist of it is that fastcall uses registers as the first four parameters, and the stack frame for the rest for the functions parameters.
So I started writing bits of test code and looking at the generated assembly. All the samples I show were tested using Visual Studio 2013 Update 5.
Here is a simple example, just to see how the calling convention works. Test1()
takes a int, a double, and pointers of each kind.
|
|
This is a useless little program, but it will tell me how values and pointers are passed around and set up when calling a function.
Here is what the x64 Debug version looks like.
|
|
The pointers are passed around via the r9 and r8 registers. The 32-bit int was passed in via ecx and the double via XMM1 registers.
Let’s take a look at what the optimized looks like.
I have to modify the code a little to fight against inlining by modifying the function prototype a bit by specifying noinline
.
Here is what you get:
|
|
Small, compact and to the point, which is what you would expect from a optimizing compiler. This is the result of whole program optimization. The values are being read from memory locations directly, and most of the operation has been optimized out. This is not what I need.
If I did want to generate some assembly code for use via C#, I would have to use a modified version of the non-optimized code for it to avoid making use of hard-coded memory references for storing data and instead, loading up data from registers.
Maybe I should look at something similar to my original problem, which is manipulating arrays of data.
How about something like this
|
|
Here is a debug build’s output.
|
|
I got a lot of help from this article about understanding optimized x64 assembly. There are several things in this code that would make our lives easier if we ever had to debug this code. For example, copies of the function parameters are saved in the stack. Guard bytes (0xCCCCCCCC) are used when initializing the extra stack space for temporary variables on the stack.
Let’s repeat the process now, but for the release version. I had to make some changes in the function to prevent it from being inlined. The release version is build with /Ox (Full optimization) turned on. And since the purpose of this exercise is to speed things up, I want to make sure my loop gets vectorized. So I turned on the vectorization report to verify this, and also used switched the floating model to /fp:fast
|
|
__isa_available
checks to see if SSE4.2 is supported.
My current CPU does not support SSE4.2 so the code path skips all that goodness.
If I have made temp
an int
instead of a double
, the compiler would have generated the SSE2 integer arithmetic instruction set.
I actually did confirm this, but let’s continue with what we have so far
Generate assembly byte code
Based on what I read earlier, I whipped up the following bit of code to calculate the sum of squares.
|
|
There is no SSE here, but it should work. Now all I need is to assemble this, and generate the byte code. I used gcc for this. I generate the .o file, and dump it’s disassembly into a text file
|
|
Manually parsing the dissassembly for the byte code is a pain, so I wrote up a quick F# script that does that for me.
You can see the script here.
Just give it the full path to test.out
and it prints out the byte array to the debugger.
How do you actually use the byte code in C#?
You need to allocate executable memory space using VirtualAlloc
and copy the byte array there.
Then use Marshal.GetDelegateForFunctionPointer()
to actually execute the instructions with the
correct method signature.
I wrote a simple console application that shows this.
|
|
This console app also benchmarks the byte code with it’s C# equivalent. Notice I’m not using unsafe C# here. I probably should have since will generate code closer what the byte code looks like. It runs the two versions of the sum of squares a bunch of times and calculates the amount of time each run took.
Here is the output of the console app for a release build.
|
|
Looks like the assembly version is about 3X faster than the plain C# version. I didn’t expect there to be such a huge improvement. I wouldn’t recommend using this approach for production code. It would be very tedious to maintain this manually, and I would trust the .NET JIT to improve the code performance over time without me having to rewrite the C# code. This was purely an academic exercise on my part.
Next thing I’ll try is to use SSE2 intrinsics and embed the byte code for that. But that’s for another post.