Optimizing Code in assembler
Previously we have covered optimizing your code directly in DSP Code component. Flowstone allows you to go even deeper and edit your code in assembler. The topic of how to use assembler was covered in previous 4 part article “How to use Assembler?”. I recommend reading at least the first two parts – they already cover some of the assembler optimization tips (some of which may reappear in this article too). You do not need to have the assembler mastered to make use of most these basic tips – even for a complete beginner these examples will be easy to follow.
So let’s come to the actual tips/steps:
0. Optimize your code in Code Component first
Optimizations done in the DSP code component carry down to assembly. In fact, if you would port inefficient code into assembly you would probably end up doing the same optimizations as in Code component. However, it would take you significantly longer, because of assemblers poor readability – in fact you might even miss a few thing here and there and they may have a bigger impact on CPU load, than the optimizations you’ve done. Do yourself a favor – double check if your DSP code is as optimal as possible. This naturally doesn’t apply if you write the algorithm in assembler from scratch – that’s why I numbered this as step zero.
1. Use registers instead of variables whenever possible
Reading data from memory is always more expensive than using data that is already in a register. Values in registers are already present in your processor and it can access them with 0 to 2 cycles penalty, while reading data from memory takes 3 cycles in best case scenario and several hundred cycles in worse case scenario. So first thing you should do when optimizing in assembler is to store constants into free xmm registers and then use them in the code. It makes the code less readable for a human eye, so make use of the comments – mark that register xmm6 now holds value of “att” and mark that every time you use the value, as seen in the example below. Also replace all your temporary variables with registers. In the example below variables “normin” and “c” are completely unneeded – you may simply use registers instead. Also variables “att” and “out” are loaded from memory several times – we may load them to registers once, and than use that instead. We may see that variable “rel” is read only once, so there is nothing we can improve there.
2.Remove unnecessary moves
This naturally leads to several other optimizations. As you can see in the first 3 lines in stage2 – we use xmm0 register in the computations and then we pass the value to a register xmm7. Now this was perfectly valid in the original code where we passed the value to “normin”. However, now it is just an additional unneeded step. We may simply replace all xmm0 in that part with xmm7 and delete the last movaps. The same situation occurs several times in the code, since we replaced some variables with registers.
There is another step, that we may take to reduce excessive moving of data. Normally when we use a variable we first have to load it into register and use this copy. Remember, we replaced variables with registers – now when we use the variable last time, we may do so destructively – instead of copying the data from old register to new register we simply use the original register. You can see this in the last segment of the code – we make copy of “out” (now called xmm4), however we use that value last time in the code – we may skip the step of making copy (by deleting the “movaps xmm0,xmm4″) and use xmm4 instead of xmm0 from that point on. The same situation occurs just a few lines above – we use “att” (xmm6) the last time. Instead of making copy we pop it straight in (notice that this also changes in which register the value of “c” ends up – we have to make appropriate changes further on).
Now let’s have a closer look at the last few lines of the code. You may see, code subtracts xmm7 from xmm4 and then makes copy of xmm4 and uses the copy further on. the same happens just two lines lower with xmm1 register. Why? Reason is simple – compiler is build in a way, that it can compile fairly complicated formulas by storing results of brackets multiplications and divisions as copies and then reuses the rest of the registers in other parts of the formula. This naturally comes at a price of unnecessary moves in the assembler code when simpler formulas are used. We may apply the same logic as before use the results destructively and remove the step of making copies.
Now the stage(2) of this example optimal – there is nothing more we can do to improve it any further.
4.Make use of assembler features not present in DSP Code
Only thing that stands out in the previous example is the pre-calculation of the “abs” – the binary mask for fast absolute value calculation, mentioned in DSP code optimizing article. In assembly, you may declare variables as integers. Making an integer with desired binary structure is much easier than with float. In fact, you may completely delete the stage(0) and declare abs as “int abs=2147483647;” That makes the Envelope Follower in the example fully optimal. However, there are several other features that couldn’t been shown in the above example. Working with integers is a powerful feature of assembler compared to DSP code, especially in handling arrays.
5.Delete unnecessary code in SSE incompatible instructions
Some instructions are not supported by SSE instructions and must be executed by old-fashioned FPU instructions. In order to process 4 channels (which is done by a single instruction in SSE) compiler has to enter them 4 times. This is pretty much the only way to go when the code is used in poly and mono4. However, in mono (and similarly in mono4 when some channels are not used) all you need is to process one channel – the computation of the other channels is just a waste of CPU. Consider form example the “rndint(a)” instruction:
You can see first the value of “smIntVarTemp” is loaded int channel (0), rounded to integer and stored back in “smIntVarTemp” – and the same another 3 times for the rest of the channels. Apart from optimization obvious from previous examples (the “smIntVarTemp” variable is unneeded – we may load the value directly from “in” and store it directly to “out” – however, this is not an universal rule – most of the time the variable is in fact unavoidable), we may easily delete computation of channels we do not need. This improves CPU load significantly, since the FPU instructions are more demanding on CPU than SSE instructions (this may seem counterintuitive, but FPU uses 80bit float format, so also conversion is needed + computations in 80bit numbers are naturally more complicated and in SSE the 4 operations are done in parallel, so they take CPU as if it was one).
The same rule may be applied to moving from/to array, trigonometric functions, logarithms and power function. In fact, it will have bigger impact on CPU then all the playing with registers explained before. Also it it very easy to do even if you do not know nothing about ASM – just find peaces of code that clearly repeat 4 times and delete the unneeded channels’ computations.
6.Consider using approximations and special case scenarios that are implementable with SSE
I deliberately chosen rndint(a) for previous example for two reasons. 1. it has the simplest code from the non-SSE functions and 2. it in fact has SSE implementation. You may ask yourself why this isn’t a default form of rndint? Reason is compatibility – cvt operations are SSE2 feature which is not present in older CPUs (which was quite common case 10 years ago when Code Component was introduced). When exporting your application you have an option to “include support for SSE and SSE2″ which in fact means your Code will be compiled in 2 versions. One as you write it which is SSE2 compatible and one with SSE2 instructions replaced with their older FPU implementations (which in the example below take more CPU than standard rndint) that can run on machines with no SSE2 support.
Another similar optimization can be done in reading and writing arrays, but only if you know, the same index will be used for all 4 channels:
The original implementation didn’t even fit my screen, while the SSE version takes 11 lines. It is even faster than “ripped” 1 channel version of the original implementation, so consider it when you write modules for mono. Writing to array may be implemented in similar way. There is also another thing to consider when working with arrays – integer format. Index for the arrays must be in int format, so it is convenient to actually calculate them as integers, rather than as floats and convert them (conversion takes a lot of CPU and integer instructions are actually even cheaper than float). Read the “How to use Assembler? part 3: ALU, FPU and array management” for more information on how array management and integer instructions work and how and when to implement them optimally.
Also, as mentioned in DSP code article, consider using the approximations instead of stock instructions and suggested to use MartinVicanek’s Stream Math Functions and split your code into multiple components. In assembler you can Copy-Paste the code into your ASM code and adjust it to work with your code. This allows you to keep the code in a single ASM component (however, not necessary makes it more readable, since ASM is quite a mess in general). Also, don’t forget to leave credits when you do this with 3rd party algorithms ;-).
Another thing to mention is the rcpps xmm0,xmm0; instruction. This calculates reciprocal of a number. It is similar to divps xmm0,xmm1; except that the numerator is always 1 (therefore it calculates only the reciprotial instead of full division) and also yields only 12bit precision (apart from usual 24bit precision of divps)and takes only half CPU load of divps. Consider using it when the precision is not so much of an issue.
That covers the most basic tips for optimizing code in assembler in Flowstone. Off course there are several other things you can do to improve the code. Some of them apply only to assembler, others also to Code component and some even to entire schematic. They will be covered in next article, since they require some deeper knowledge about processors and in fact may be processor-specific. One of those things are CPU costs of the individual instructions – the answer to question “how much CPU this instruction uses?” is not so straightforward. Stay tuned for more…