Optimizing DSP code (in DSP code component)
One of the main fields which make great benefit of computers is without a doubt the Digital signal processing. Manipulating images, their sequences (video) and sound in both real time and offline (rendering, printing, etc.) are possibly the most used features on modern computers, both in industry and culture. In fact, in many cases computer/digital-based manipulation replaced their analog counterparts to a considerable degree if not completely (still using VHS tapes anyone?).
From the very beginning crucial part of the DSP was the CPU load, because generally you need to process several thousand samples/pixels per second, which in case of real time processing is a limiting element. Luckily, all anticipated sides are doing great job in improving the computation speed pushing the limits further and further. Computer manufacturers build faster and more efficient CPU every day, adding cores for parallel processing, 64bit support for more efficient data management just to mention the most significant recent changes. The next step is to use the resources provided by a computer most efficiently. This part is in the hands of a programmer – which is most likely You, reading this article.
In Flowstone we have an ability to manipulate streamed audio data, which in conjunction with simple interface the Flowstone offers lets us create VST and VSTi plugins as well as standalone EXEs. In case of VST plugins the CPU load is really a big question, because although it is hard to push CPU load to overload on a single instance, the effect is expected to be used in multiple instances and with other plugins in parallel. In that case every CPU cycle counts. Now off course there are many things that are out of our hands since the FS workflow has it’s limitations but there is number of things we should consider in the process of optimizing our schematics performance.
SSE support in FS streams
Flowstone uses SSE subset of CPU instructions to calculate streams. That allow us to calculate 4 parallel channels of audio at a cost of one – so FS is very efficient by default. In fact each group of 4 voices in poly stream uses the very same code. In case of mono, make an advantage of pack and unpack primitives to use mono4. Mono4 and mono use exactly the same code, so prefer using mono4 whenever possible.
The DSP Code component
Having ability to write small sections of processing in an old fashioned code is in my opinion the no.1 feature of Flowstone. Although the generated assembly code (and therefore CPU cost) is not much different from weaving an net of stream based prims, it is definitely much more summary and easier to follow and debug. In fact it provides almost optimal compromise between feature-full complexity and easy-to-use simplicity.
There are several things you should consider, when writing/optimizing your code:
1. move unnecessary code to green or ruby
Many times, calculations do not have to be done at sample rate because they change only from time to time. A very common example is a filter. It is much more efficient to precalculate the coefficients in green or ruby and provide them to the code as input variables. Use this tip whenever possible.
There are situations where this is not possible. For example when cutoff frequency of your filter is modulated by LFO. In such cases consider using Hop in your code. Although hop may introduce “stairs” (abrupt regular value changes) they might be tolerable. Finding the right compromise between “stair-ness” and CPU savings requires testing but consider this: hop(4) reduces CPU cost almost 4x and the “stairs” effect will be 4 samples long, which is basically almost smooth.
2. Some instructions are cheaper than others
Here is a list of code instructions, along with their very rough relative CPU costs:
The table clearly shows, that sometimes more is less. Here are some tips that directly implies from the above table:
Instead of dividing by constants, multiply by their reciprocal. Also when you need to divide several terms by the same value, First calculate the reciprocal, save it in temporary variable and multiply the terms by it. Multiplication is about 10x cheaper than division.
Common way to calculate logarithms of custom base is logX(a)=log(a)/log(X). When the base of the logarithm is a constant it is way much more efficient to replace the “/ log(X)” part by precalculated constant and again use multiplication.
This rule may be applied much more widely. Combine it with precalculating the values in green whenever possible. Also consider using Stage(0) for creating lookup tables and constants that are channel/voice specific. Stage(0) is calculated first and only on the first sample, so you may consider it as part of declaration.
Avoid pow(a,b) function whenever possible. This is the most CPU costy function in the Code component. When you need to power a value to an integer order, use multiplication instead.
Avoid using arrays if possible. Loading data from an array, is much more CPU costy than loading from a variable. This might not be so obvious why, so here is a reason: Code component uses SSE so it can load and store data to a variable all channels at once. However, array may use different index on each channel, so it cannot make use of SSE – must load/store data channel by channel. Also indexing works in integer domain while all values in DSP code are floats, so conversion is needed too. That’s why loading/storing data via arrays is more than 5x more costy than using normal variables.
Use a simple trick to achieve cheaper abs(x) function. Basically create a bitwise mask in stage(0) that removes the sign when used via & operation.
MartinVicanek created a set of approximated/ASM optimised function modules which include trigonometric functions, exponential and logarithmic functions. You really should consider using them instead of / in conjunction with Code component.
3.Precision vs. speed
Sometimes the chosen algorithm may affect both speed and precision of your code. Floating point number in general is a number in form sign*mantissa *(base^integer). You are probably familiar with notations like -1.548554 * 10^6 (often called scientific notation). This notation is in fact a floating point notation, because the mantissa ( 1.548554) is a number in range <1,10) with a fixed number of digits and the exponent ( 6) specifies the position of the decimal point. Computers use floating point numbers in base 2 where first bit is a sign, 8bits are exponent and 23bits are mantissa – such number is than reconstructed as ((-1)&sign)*(1+mantissa)*(2^exponent) where mantissa is in <0,1) range. Such number translates to decimal representation with ~7digit precision. Regardless of base, the logic stays the same – you may represent both very small and very big numbers with the same relative precision (compared to the order of the number). However, you may suffer from rounding errors when you add numbers of greatly different order of magnitude, because the result is rounded back to 7 digits.
A good example of tradeoff between precision and speed is a biquad filter. It has two direct implementations called Direct form I and II respectively. While Direct form I uses 5 multiplications and 4 additions and four storage spaces, direct form II uses only two. So it runs faster right? Yes, but also suffers from bigger rounding errors, because smaller values are added to bigger ones compared to direct form I. This increases rounding noise, reduces precision of frequency and phase response and also increases the risk of instability.
Another field where trade off between precision is to consider is approximation of functions. As seen above, some functions take enormous CPU to compute. Many times it is useful to replace them with approximations like Taylor series, especially if the input values are expected to be in certain range. In fact, your computer uses approximations and iterative methods to calculate many complicated operations (division, for example), so do not get fooled by the name approximation – sometimes the precision is comparable or even superior to stock code component implementations (above mentioned MartinVicanek’s Stream Math Functions are a great example of that). For this purse you may consider splitting your code between multiple interconnected Code components and use the assembly optimized/approximated functions as separate modules. Although it makes the code much harder to read, the CPU savings may be considerable.
4.Avoid Denormals (also called subnormals, de-normals, denormal numbers,…)
Denormals are numbers so small (close to zero) that they don’t follow normal floating point specification (they are special values). To calculate denormals, your processor has to switch into denormal mode and their computations take enormously long. You definitely should avoid their occurrence in your schematic in general. Cyto covered this topic perfectly 2years ago on the synthmaker forum. Here is a link to excellent Cyto’s Guide to Subnormals. Anything I could write here would be just copy-pasting his text and code, so make sure you have a look at it .
And that’s it! Following these few guidelines you should be able to reduce the CPU load of your schematic considerably. Now, there are a few other more advanced tips that can push your CPU load even lower, but they require quite a bunch of knowledge about Assembler and processor operation. After you optimized your code in the DSP Code component you may go one step further and optimize the code in assembler. A guide to do that will be covered in the near future. So have fun optimizing!