Can your CPU calculate faster than you thought? Can .NET take advantage of things that were mostly C/C++ etc. Domain to create even faster applications? Is only python good for data calculations?

If you want to know about some not that popular, but extremely useful goodies in .NET Core 3.0 onwards in shape of SIMD Intrinsic functions please keep reading!

Past half year I’m mostly working on my autonomous rover/tank project, that I’ll present in near future, after hitting next milestone with it. Unfortunately I fried Meadow board that is main brain of it for the second time so I’m waiting till I’ll have funds to order new one from US and even then it will take probably at least two weeks. To not loose time I decided to tackle another task, that will be part of this project – to make it really autonomous it needs to be aware of it’s surroundings. For that I decided to use stereovision and, as solution that I’m planning to use doesn’t require Meadow I can work on it even now. For the details – there will be another post, but let’s get to main topic of Today:

Single instruction, multiple data (SIMD) in C#

Me being me – I decided to built my image processor myself from scratch. Yes, I know OpenCV exists, but where’s the fun in that? Past few days I spent on researching different image processing algorithms and getting general knowledge in that topic and one thing that is obvious even without that – image processing is extremely computation heavy. And I mean – extremely. ESP32CAM modules I’m planning to use output 2MPx images. Let’s just assume using simple 3×3 convolution core on all 3 channels outputting one channel – we have 9 values to multiply for each pixel for each channel and then we need to sum all of them. 2 millions * 3 * 9 = 54 millions multiplications, then 3*9 sums per each pixel so another 54 millions sums. Over one hundred million operations for simple 2Mpx picture and simple 3×3 kernel! Who’s ready to calculate that by hand? 🙂

And for more advanced operations workload will grow just exponentially. It’s a case where every speedup will have high effects on output. We’re not talking here about webpage loading 10ms faster – we’re in the multiple seconds per calculation realm. And having 1 or 10 frames per second is extreme difference.

In my previous computation heavy project, n-body problem simulator, I already tried to tackle that problem, as it required calculating forces between each body in each step, so also grew exponentially with number of objects. There at least were more variables – for example setting bigger time step would allow for faster calculation, but due to use of Euler or Runge-Kutta 4th order method simulation wasn’t perfect and with too big time step strange artifacts could happen – for example Moon could shoot away from Earth, which is obviously not quite what we observe outside our windows!

Movement of inner planets and Sun around Earth with Earth kept in center – isn’t it beautiful?

There I did best what I could at the moment – I tried multiple approaches and in the end I was most successful with both – using Parallel.For and manually creating Threads and dividing workload between them. As it was real time it required also some other stuff like using Barrier to synchronize calculations between threads etc., but in the end I was able to get up to over 5 times faster calculations on 8 logical cores CPU (i7 7700K) with high number of objects, where Parallel.For started to beat up serial calculations with from 100 objects, and my own implementation was gaining speed with as low as 10 objects! I was really happy then, but I couldn’t get anything more, I tried different algorithms, like moving form Euler which I knew and user as long as I remember to RK4 that was completely new for me etc. but that was more learning and fun than real results. I knew one solution, that would definitely help – using CUDA – but that was something I didn’t wanted to use then and I’m still putting this away for future. It’s great tool obviously and I have to learn it one day, as it perfectly fits within what I love most in coding – but just not yet. Other that that – I was done, couldn’t optimize it more so I played around with it and left it.

But today I know more, I can code better things – so this wasn’t enough anymore. Is there anything that could help except moving to GPU?

YES!

I must admit I knew about AVX for years, but somehow I never dived into SIMD concept more. I just knew it’s something that should be faster, but slows down CPU and not many applications use it. That’s it. Also, as far as I can see in docs System.Runtime.Intrinsics.X86 was released with .NET Core 3.0, so just barely one and a half year ago, there was also previous Vector<T> implementation in System.Numerics for few years now, but to be honest even Today when searching for tutorials on that topic – there just isn’t much out there. Personally I met with it first time just few months ago, when researching ML.NET for future use, as it automatically uses different instruction sets – AVX, SSE or just standard calculations depending on platform – but at the time I just glanced over it.

Now on the other hand I wanted as much speed as I could get – so I decided to research that topic. And even thou I’m just crawling in it like a newborn – I can already see it is GOOD idea. VERY good.

What is SIMD? AVX? SSE? WTF?

For those who don’t know: SIMD is special instruction set that allows to perform calculations on multiple numbers at once – hence the name Single Instruction Multiple Data. If you’re not “IT guy” just imagine clerk with calculator. He reads one number, put’s it in calculator, selects operation, reads another, put’s it in, get’s the result and writes it down. And then again with next pair. And next, and next… He only calculates one at a time.

You cna put multiple clerks in one room – put 8 and you can have final results 8 times faster, not taking into account that they maybe need to swap data between themselves etc. This is just straight parallel computing – with multiple cores you can just divide workload between them and get results faster. As I’ve said above, with 8 logical cores in my MiniSolaris project i was able to get at least 5 times speedup, as there are of course some things that just can’t be divided that way, sometimes one clerk (core/thread) takes more time and others have to wait for him doing nothing etc. so it’s not a perfect world. But still we are bound by the fact one clerk can do only one thing at any given time.

But… Can’t we really do things other way? There must be something.

Imagine that each clerk has let’s say for pairs of hands and 3 calculators – and can do each thing for four numbers at once. That would be interesting view! But what that gives us?

In the time he was inputting one number – now he does 4. So in the end in the same time he does 4 times more calculations. It must be cheaper than employing 4 people! Altough trying to finr 8 handed clerks might be a bit hard, and surgically making them probably will cost more…

But back to the topic – imagine processor can do exactly the same thing! Instead of calculating only one numer at a time – we can load multiple ones and in the same time get multiple result!

For example, my CPU – i7 7700K – supports AVX2 instruction set. It means it has special 256 bits registers – placeholders that can be filled with multiple number. Just imagine your tax declaration, where you can put only one number in each field – in some of them you are supposed to put only one thing, usually some number, but in others you can put for example multiple words, like company name, as long as they fit within cells number, right?

Using SIMD instructions CPU does exactly that – in one placeholder, that can take up to 256 bits of data – it puts multiple numbers! For example one byte – number between 0 and 255 – takes exactly 8 bits in computer memory. 256/8=32 – we can put there even 32 numbers at once! Only 255 you might say… Surprise, surprise – typical image that you view on your screen consist of 3 channels – R G and B – for each pixel, each one being exactly that – value between 0 and 255! Even tough it doesn’t seem like a lot – it allows you to experience the whole world in so many colors straight through your phone!

After we put our data in those special registers – we can use special CPU instructions that are aware of data type in those registers – so for 8 bit values we will use instructions for exactly that type of data, so the results of one calculations won’t flow over to another number. And here we get to a place where things start to be more complicated.

Intrinsic in C#

As I’ve already mentioned, in .NET world this is quite new concept. Just as an example, as I have access to Pluralsight through my work – I wasn’t able to find anything on that topic. Not one course. Only some basics in C++ course. Even looking at YouTube – there just isn’t much. And a lot of videos there are – show older ways of using Vector<T>. It is easier to use, it checks hardware support on it’s own ans chooses best option possible (using Intrinsics namespace if you run code on PC not supporting it – it just won’t work) – but as far as I found on the internet they seem to be slower (although don’t take my word for it, haven’t compared myself) and also somehow in such situations I just prefer to have full control to get best results possible. And I’m planing to use it only on exactly one, mine PC, so I can write code exactly for it – and if in the future I would like more universal code it is easy to just write few versions of code, check with isSupported method and select the one that will work.

As I mentioned above – C++ and C support that obviously as long as it exists. But if someone decides on C# instead of those languages – there must be a reason for that. But that doesn’t matter we can’t wish to take advantage of best things there, right?

So let’s check what is available for us. As I mentioned above there are different versions of SIMD instructions, older usually using smaller registers (less numbers at once) and having less different instructions, but I’ll be using specifically AVX2 set, as this is best I can get on my hardware and should be supported by most modern CPUs anyway. There is also newer AVX512 set using 512bit registers (64 bytes at once!) but… for now I’m still happy with my PC!

Here we can see all instructions available in AVX2, also as this class directly derives from previous AVX that also uses 256bit registers we can use that instruction set too (for example Store method can be found in AVX). There’s a lot, right? But wait, what are those strange codes, no explanation what they do?

Don’t you worry, and go here! This is official list of all SIMD instructions available for intel CPUs. We can select only subsets interesting to us, like selecting Load, AVX and AVX2 will show us list of instructions available to load different data types into 256 bit registers. As you click everyone you can get explanation what each instruction does and sample C++ code. But what does it have to do with C#?

Just compare names there and in Microsoft documentation. As you can see – they are the same. So you can approach it both ways – find interesting instruction on Intel website and then find corresponding one wrapping it in C#, or if you want explanation what C# instruction does just search for the name in intel list.

Ok, but how well does it performs?

I knew it must be fast. But just to check myself i wrote small console app in .NET5, as it fully supports intrinsics. Whole code can be found here. It’s rough but it’s doing what it is supposed to do and it is just quick, one afternoon project, so please, don’t complain!

To test I made 6 different functions, each performing multiple arithmetic operations on two arrays, storing results in third one. At the time of writing this in each iteration for each cell I was doing 8 times addition, subtraction, multiplication ad division – just to simulate heavy work that would be done on images. Field count specifies how many times each method will be invoked in each test run, runs specifies how many runs we want. Then I save times from each run of each method and average them to get final result.

Just one warning – using high count will lower Thread and Thread + AVX methods speed due to way I’m handling threads. Each time method is invoked multiple threads are created and this takes some time. In my use case, where I’ll be spending a lot of time on one image it is perfectly fine, overhead of creating new threads is not hurting performance – but with higher count and small resolution it simulates many small pictures – and then that way of handling threads makes them slower by multiple orders of magnitude! That could be probably easily omitted with just using ThreadPool threads, but again – I’m simulating for one big chunk of data to be calculated, not multiple small ones.

Let’s quickly check what each function does:

Normal:

private static float[,] SumNormal(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            float[,] result = new float[height, width];
            for (int i = 0; i < height; i++)
            {
                for (int j = 0; j < width; j++)
                {
                    result[i, j] = a[i, j] + b[i, j];
                    //(...)
                }
            }
            return result;
        }

It’s just as easy as it get’s. It’s just iterating over each row, then each column, then performing calculations on selected cell.

Parallel:

private static float[,] SumParallel(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            float[,] result = new float[height, width];
            System.Threading.Tasks.Parallel.For(0, height, (i) =>
            {
                for (int j = 0; j < width; j++)
                {
                    result[i, j] = a[i, j] + b[i, j];
                    //(...)
                }
            });
            return result;
        }

Very similar to previous one, only instead of first for loop I’m using Parallel.For – that way program automatically manages dividing work between threads and uses all cores to perform calculations.

Threaded:

private static float[,] SumThreads(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            int length = height * width;
            float[,] result = new float[height, width];
            int threadCount = Environment.ProcessorCount;
            int singleHeight = (int)Math.Ceiling(height / (float)threadCount);

            Thread[] threads = new Thread[threadCount];
            for (int k = 0; k < threadCount; k++)
            {
                int index = k;
                threads[k] = new Thread(() =>
                {
                    int startHeight = index * singleHeight;
                    int endHeight = startHeight + singleHeight;

                    for (int i = startHeight; i < endHeight; i++)
                    {
                        for (int j = 0; j < width; j++)
                        {
                            result[i, j] = a[i, j] + b[i, j];
                            //(...)
                        }
                    }
                });
            }
            foreach (Thread t in threads)
            {
                t.Start();
            }
            foreach (Thread t in threads)
            {
                t.Join();
            }
            return result;
        }

This one seems more complicated, it’s based on my own idea to divide workload instead of relying on .NET. It reads number of available logical cores using Environment.ProcessorCount, in my case 8, then creates that much Threads, so the workload can be divided between them. Then for each thread it assigns part of array, in my case 1/8 of rows and each thread calculates results for only it’s part of data the same way as normal, serial one. In the end I’m using Join() so the main thread wait’s for all of them to finish their job. As each one operates only on it’s own data set and there is no data shared between them, nor are they changing the same parts of memory – there is no need for locks, or any other synchronization, I can just simply let them run till end without any interference between them.

Ok, now we’re getting into more interesting stuff, next 3 methods are technically the same as those 3 above, but are taking advantage of instrinsics. Let’s see:

Normal with AVX:

private static float[,] SumAVX(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            int length = height * width;
            float[,] result = new float[height, width];
            unsafe
            {
                fixed (float* pA = a, pB = b, pR = result)
                {
                    for (int i = 0; i < length; i += 8)
                    {
                        Avx2.Store(pR + i, Avx2.Add(Avx2.LoadVector256(pA + i), Avx2.LoadVector256(pB + i)));
                            //(...)
                    }
                }
            }
            return result;
        }

As you can see it’s a bit different. First thing we have to do is allow unsafe code in our compiler, but fortunately fastest way is just to use unsafe keyword and then ctrl + . – Visual Studio will offer to do just that. Other way is just to change it in project properties.

Why unsafe? We need to use pointers. POINTERS.

Yeah, if someone would tell me half a year ago that I’ll be using pointers I would laugh at him probably saying that there is are reason why I decided on managed language like C# instead of C or C++. But in the end, at least on that level here, it’s very easy.

Keyword fixed instructs program to keep specified data in one place in memory, so we can use pointers, as otherwise it might be moved around, then float* pA etc. specify memory address of first value in the array. As it “knows” this is float type, when we increment pointer by 1 we get address of next float in line. And because memory is linear – 2d array is stored just as a 1d array with rows stored after each other. So, for example, for array:

int[,] array = {
{5,6,7},
{4,3,2},
{1,8,9}};

In memory it is stored as 5,6,7,4,3,2,1,8,9. If we create pointer fixed (float* pA = array) it will be pointing to number 5. If we increment it by 1 – it will point to number 6. If we want to have address of number 3 – we need pointer to second row, second column. How? Row number times column count gives us first number in selected row, then we add column number and voila! Here 1*3+1=4. As we can check manually – number 3 is fifth, so in IT terms 4th (with first number being zeroeth (?)). That way I can iterate over array as in normal method, and when having both coordinates easily use them to get memory position of selected value.

Then the whole magic happens here:

Avx2.Store(pR + i, Avx2.Add(Avx2.LoadVector256(pA + i), Avx2.LoadVector256(pB + i)));

Instructions Avx2.LoadVector256() takes just chunk of 256 bits from memory and loads it to one of AVX registers. I’m taking that way data from two arrays.

Then Avx2.Add sums data in both registers. Because pointer I specified is pointing to float value – compiler “knows” that data inside this is float so uses appropriate SIMD instruction to sum them as 8 separate float values – like a clerk with 16 hands clicking “=” at once on 8 calculators!

Then instruction Avx2.Store() again copies 256bit data chunk into memory under result array address. It doesn’t need to know that those are floats – like when you copy your company name from one tax form to another you just need to transfer all the characters – only when reading it you will separate them into single words, same as here – when we access this array to read one value we will just use the part we need then.

Simple? Hard? I think that when you get how it works it’s not that hard. Of course it requires keeping data in specific way to get any gains from it – if you would try to load each value separately from different locations then you would loose most of benefits here.

This can be a bit counter intuitive for “typical” .NET programming – for example in my MiniSolaris I was using typical OOP approach where each object was it’s own instance of it’s class, with all it’s properties stored in it. In that case data is not continuous – in memory after two position values you get maybe mass, then name, etc. and only later we get to another object.

For SIMD we would need probably to keep data in specific arrays – for example array of coordinates, array of masses etc. Then they could be easily loaded in chunks without getting each one from different address for faster calculation using SIMD instructions. And each object could for example keep reference to it’s coordinates in those arrays.

Ok, let’s check remaining classes, although they are similar in principle to thing above:

Parallel with AVX:

private static float[,] SumParallelAVX(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            int length = height * width;
            float[,] result = new float[height, width];
            unsafe
            {
                System.Threading.Tasks.Parallel.For(0, height, (i) =>
                {
                    fixed (float* pA = a, pB = b, pR = result)
                    {
                        for (int j = 0; j < width; j += 8)
                        {
                            Avx2.Store(pR + i * width + j, Avx2.Add(Avx2.LoadVector256(pA + i * width + j), Avx2.LoadVector256(pB + i * width + j)));

                            //(...)
                        }
                    }
                });
            }
            return result;
        }

This one looks like a child of Parallel and AVX versions – and it is exactly that. I’m iterating over rows with Parallel.For and then using regular for over selected row and calculating using AVX2 as above.

One thing I didn’t found solution around – fixed couldn’t be used outside of anonymous function and pointer used inside it, so I had to create pointer each time, but looking at result’s this isn’t harming performance.

And final Threaded + AVX version:

private static float[,] SumThreadsAVX(float[,] a, float[,] b)
        {
            int height = a.GetLength(0), width = a.GetLength(1);
            float[,] result = new float[height, width];
            int threadCount = Environment.ProcessorCount;
            int singleHeight = (int)Math.Ceiling(height / (float)threadCount);
            int singleLength = singleHeight * width;

            Thread[] threads = new Thread[threadCount];
            for (int k = 0; k < threadCount; k++)
            {
                int index = k;
                threads[index] = new Thread(() =>
                {
                    unsafe
                    {
                        fixed (float* pA = a, pB = b, pR = result)
                        {
                            int shift = index * singleHeight * width;
                            float* pAlocal = pA + shift;
                            float* pBlocal = pB + shift;
                            float* pRlocal = pR + shift;
                            for (int j = 0; j < singleLength; j += 8)
                            {
                                Avx2.Store(pRlocal + j, Avx2.Add(Avx2.LoadVector256(pAlocal + j), Avx2.LoadVector256(pBlocal + j)));

                                //(...)
                            }
                        }
                    }
                });
            }
            foreach (Thread t in threads)
            {
                t.Start();
            }
            foreach (Thread t in threads)
            {
                t.Join();
            }
            return result;
        }

It uses the same logic as non-AVX version, only difference is each thread is getting pointer to beginning of it’s own chunk of data based on number of rows and CPUs available and then it’s preforming calculations in series.

Results?

Ok, so what were the results? As this is fresh code I haven’t done extensive testing and results analysis yet, but what I learned:

Normal version was obviously slowest one.
AVX without parallelism was always faster than normal version.
With only one operation inside loops Parallel + AVX and Threaded + AVX were usually 3-5 times faster than Normal one (using Release configuration with code optimization, in Debug normal code was even 2times slower) and were the fastest ones. Parallel and Threaded were somewhere in between, gaining speed from using all cores but still slower without AVX speedup.
Both pairs of Parallel and Threaded with and without AVX were usually performing in similar manner, sometimes one faster, sometimes other. But there was issue mentioned above with creating too many threads.
With calculating a lot of small arrays, like 40x40px – threaded versions became many, many times slower than even normal one – but that was expected.

Knowing all of that I decided to push it to it’s limits. I lowered count per each run to just one, put up bigger array, put multiple instructions per each loop to simulate heavy workload on each cell – and my, oh my, that was better than I could ever imagine! How much faster? You’ll see in a second. One more observation – threaded versions were here a bit faster than parallel ones in each test – exactly what I observed on MiniSolaris when I “invented” that way of sharing data (although I’m sure people were using it like this before, I know, but no – I haven’t found it on StackOverflow!)

So for last test I wanted to go for maximum. 32 operations each time, 10000×10000 array. Results… Just see the picture below!

Columns of numbers are just results in ticks for each run, but the most important part are values in Mean and Speedup – mean is average of all runs for each method, speedup compares it to normal method time.

First – threaded + avx method is over 17% faster than Parallel one. So it looks like adding a bit ow my own code can be beneficial in the end. Why non-avx ones are so different? Not sure… But anyway…

Second – most important – threaded + avx method is SIXTY SEVEN TIMES FASTER than normal one! Sixty seven times! This is EXTREME difference! For a thing that would run at 1 frame per second id could become faster than 60Hz monitor can show! I never suspected difference can be THAT high! Even in theory 8 threads, 8 numbers at once gives 64 times more calculations, and it is even more than that! And of course it is with code optimization, so compiler did it’s magic to make normal method faster.

Summary

Obviously real life usage can have mush smaller gains – it heavily depends on use case, what will be calculated, how big pictures, how many and what type of operations – but I think this at least clearly shows that there will be definite gains from correctly using AVX instructions.

So is it worth to spend few dozens of minutes to get your head around pointer basics and use them with Intrinsics? Definitely! Obviously it won’t be beneficial in every use case, to be honest – for most things it would be just burden, as the code is harder to read and maintain. If application is fast enough with standard ways of coding or even with parallelism – there is no point in that. If your app just get’s some data from database – it doesn’t matter how fast can you calculate if you have to wait for data anyway!

But in those rare cases where spending a little bit more time on coding and maybe documenting it for others to understand will be beneficial? Where you need to perform A LOT of calculations? Then definitely it’s a way to go! At least one of possible ways – CUDA I’m looking at you! (although for utilizing it you need appropriate graphics card, and on the other hand SIMD in at least SSE2 flavor should be available in every currently used desktop CPU and soon we should see also ARM support!)

Sebastian Wnęk blog

Just a simple library of my personal projects

Intrinsics in C#