Advanced DSP optimization tips
In previous articles we have covered basics of optimizing your Code both in DSP code component and in assembler. However, there is a bunch of other tips, that may be (to some degree) applied to both or even entire schematic. To squeeze most CPU power you should be aware of them when choosing and writing an algorithm, rather than just optimizing it after it is finished. These tips have a rather complicated explanation that requires some knowledge of how processors work nowadays – this will be covered too. A document will be released with this article that lists all Flowstone related ASM instructions, with their descriptions, CPU costs and examples of use. So let’s get straight to the tips:
1.Be aware of your cache memory
What is cache memory? Your processor constantly analyzes which parts of RAM you use. It may recognize patterns in reading and writing and predict which variables are expected to be used and loads (“prefetches”) them into cache memory (also called “near memory” – because it’s closer to processor – in fact it is a part of processor). Reading data form cache memory is much cheaper than reading directly from RAM. In best case scenario reading data from cache memory takes about 3 cycles while reading data from main RAM may take several hundred CPU cycles (so you can see – a single cache miss can take you more CPU than entire code) . Cache memory has several levels, lower levels are closer to processor, can be accessed faster but are smaller, higher levels vice versa. Luckily, variables in Flowstone streams are already arranged in a way, that they maximize the efficiency of using cache. Level-1 data cache has size about 8kb (this depends on a processor) so it can store about 2000 float numbers.
To reduce risk of cache miss (reading of unexpected value that was not stored in cache beforehand) you may do several optimizations which also apply for Code Component:
1. Avoid using big arrays. Naturally a full wave-table of several thousand samples can’t be loaded whole into cache memory. Attempting to read and write data to it may result in frequent cache misses, especially if the index changes irregularly (which complicates prefetching). CacheMiss.fsm This schematic clearly shows this in praxis – when index changes in regular pattern CPU load is almost half compared to random index changes (note that in both modes, the code is completely identical – noting gets bypassed when flicking the switch). The most risky is to use external mems by address. I believe this is one of the main reasons why FS developers opted to copy parts of mem into array instead of using the mem address directly when you use the “memin” input.
2.Recycle your temporary variables. This applies mainly to Code component. When you reuse previously used variable, it is more likely to be in cache than some other random previously unused variable. Also remember that you can use output variables as temporary variables – all that matters is if they contain the correct value after all code is executed – what happens to them in between is up to you. In ASM you may reduce likelihood of cache miss simply by using registers for temporary variables. However, with complicated algorithms you are likely to run out of registers, so using variables is mandatory.
2.Be aware of pipelining and Code prefetching
Modern Processors have an ability to execute multiple operations in parallel. The processor usually consists of multiple independent circuits (units) each responsible for different type of calculations. Typical processor has: one or more ALU (arithmetic logical units – they calculate integers and logical operations), floating point adder, floating point multiplier (yes, multiplication and addition can happen in parallel), memory load unit and memory store unit (yes, processors can load and store data in parallel and while doing other calculations). Each of these units can also pipeline instructions. That means it can start new instruction before the last one has finished. The time it takes to finish the calculation is called latency. The number of instructions of the same kind that can be started each cycle is called throughput. Commonly you will see reciprocal throughput in documentation (smaller it is more instructions can be pipelined). These values are naturally processor specific – you may find the most common latencies and throughputs in the attached document, or use google to find them specifically for given processor.
There is an obvious limitation to the parallel processing and pipelining – the instructions must be independent. If next operation uses output of previous operation as an input it is naturally impossible to process them in parallel.
To maximize the usage of this parallel processing your computer prefetches the code (in a similar way it prefetches memory) – loads a bigger segment of code (usually around 20-40 opcodes depending on the architecture and opcode complexity) and may opt to execute the instructions out of order, when it recognizes possibility of parallel processing (multiple “dependency chains”). This can easily be demonstrated. DependancyChain.fsm This schematic has 3 almost identical peaces of code – they all calculate number 122 by adding number “1” 121 times. They both do so in two dependency chains – one starts by movaps xmm0,F1; and continues with a chain of 60 addps xmm0,F1; (so the xmm0 serves as an accumulator). The second does the same but on register xmm1. In the end xmm0 and xmm1 are added and stored in the output. Each of these is an individual dependency chain – next addps xmm0,F1; requires previous addps xmm0,F1; to be finished, because it needs the new value of xmm0. However addps xmm1,F1; may be pipelined with addps xmm0,F1; because they are independent.
In what the codes of the 3 examples differ? in the length of segments of code with addps xmm0,F1; and addps xmm1,F1;. First example has 60 xmm0 additions followed by 60 xmm1 additions, while second has alternating 10 long xmm0 and xmm1 segments. The last one has alternating xmm0 and xmm1 all the time. So what?
Your processors code cache is roughly 20 instructions long. when it loads the first code segment in example 1 it sees no opportunity for pipelining (all it sees is a long dependency chain), so it slowly calculates them one by one. In example 2 it analyzes the segment of code and recognizes two individual dependency chains – it can pipeline the chain with xmm0 with chain with xmm1 and execute them partially in parallel. In example 3 this is so obvious your computer must’ve been stupid if it didn’t see this opportunity. On my machine examples 2 and 3 take only 80% CPU compared to example 1. So I’ve saved about 20% simply by changing order of several lines.
So here is the rule of thumb: Avoid long dependency chains where next code uses the result of previous. Try to give your CPU possibility to calculate things in parallel. Typical example in code component is the use of brackets:
The CPU savings are always at least several %. This depends on the instruction (some instructions have wider pipeline than others) and also on the actual code (sometimes you can’t reach the pipeline limit simply because your code can’t be arranged that way). In ASM you have even more power because you can manually arrange the code to clearly have multiple parallel dependency chains. These long dependency chains can be often found in the loops with accumulators or iterative algorithms. Although there is no way to avoid them you may make use of pipelining by having multiple accumulators in the loop. This is however rarely a problem because the “bottleneck” of processing in loops are usually other slow instructions. I was not able to produce a loop where dependency chain was the bottleneck (looping itself is quite complicated task for the processor and when reading/writing arrays or CPU heavy instructions take place dependency chain is clearly not an issue), but with unwrapped loops this is likely to happen. Both previous examples show this.
3. Code prefetching on loops and branches is complicated
When code is linear your processor prefetches the code quite easily – simply by reading instructions ahead. However when your schematic uses conditional jumps (by which loops and “if else” branches are implemented in ASM) your processor has to guess which code should be prefetched. It has to be able to do this even before the jump condition is calculated – so it really has to guess. If the guess is wrong your processor has to discard the prefetched code (along with all calculations that he attempted to do beforehand) and load the correct branch. This avoids the usage of pipelining and out of order execution and may increase the CPU load significantly. Processor uses statistics and pattern recognizing to make the correct guess. In order to ease his job use branches that switch between “if” “else” or “case” in regular patterns. Good example of this is the hop() in code component. It is an “if” branch that is executed once and then skipped n times where n is a constant. Not only the pattern is regular, also the code is mainly skipped, so it is convenient for your processor to assume it will be skipped. Similar thing happens in loops with regular or constant loop-sizes.
Rule of thumb is that if you have “if else” branching code, write the mostly picked code first, because that part is mostly prefetched if your processor is uncertain which part to pick. When the branch picking is irregular and about 50:50 write the most CPU consuming code first, for the same reason.
How the branch guessing affects CPU is demonstrated by this schematic BranchPrediction.fsm . The code in both branches is identical, so the CPU should be the same no matter which branch is prefetched. The code uses two algorithms to switch between the branches. First one is a regular ramp in 0-99 range (where values below 50 pick one branch and above or equal pick the second) which is a very predictable pattern (allowing your processor to very easily guess). Second algorithm is a pseudo-random number generator (also with 0-99 range, with the same switching rule). This pattern is random, so your processor can really just randomly guess = 50% chance of code prefetch miss. The “switch” input is used to switch between the two. Both are calculated, but output of only one of them is actually used (so the switching between the two should have no effect on CPU by itself – this may be demonstrated by deleting all jump instructions from the code).
4.Be aware of data bypass delays
Now, as I’ve mentioned, your processor consists of multiple units which are responsible for processing data in a specific way/format. Problem arises when you need to pass data form one unit to another – most common case: when you need to preform integer operation on a float or vice versa. To pass data form one unit to another your processor introduces data bypass delay – additional 0-2 CPU cycles to the operation. This delay varies depending on where and from where the data is send (for example passing data to Write (unit that writes data to RAM) is usually zero for all units) and is also very processor specific. In Flowstone we are very limited in which operations we can do. There is actually one particular example of operations that are function wise identical and executed on different units – “andps” (floating point bitwise and) and “pand” (integer bitwise and). Using “pand” after a floating point operation or vice versa may introduce the data bypass delay on most processors. DataBypassDelay.fsm This schematic clearly shows this (maybe not so clearly on some processors ).
Rule of thumb here is: If possible, use float instructions after float instructions and int instructions after int instructions in the same dependency chain.
OK and that is all Flowstone-streams relevant stuff I’ve found. Hopefully, it will help you to squeeze maximum performance out of your Flowstone schematics…