How to use Assembler? part 2: Basics and SSE
This article will come in 4 parts and will cover everything you will ever need to know about programming in assembler in Flowstone 3.0.5 and lower (including Synthmaker). This is part 2 and it will cover the basics of assembler and SSE instruction set.
To use assembler in Flowstone insert an assembler component into your schematic (it is in the toolbox). It looks and works in a similar way to the code component. The code component has a text output, which can be used to extract the assembly code counterpart of the current DSP code. You can simply copy – paste that into your assembler and the code should be fully working (it is a common starting point for assembler coding and especially optimizing your DSP code even more). It is also important to mention that Flowstone assembler uses only a subset of assembly instructions.
Flowstone takes a big advantage of SSE instructions. SSE means streaming SIMD extension, where SIMD means Single instruction multiple data. Modern 21st century processors can use SSE to process four packs of 32bit data in the same instruction which gives you massive (upto 4 times) CPU savings (the mono4 processing and poly streams in FS take advantage of that). All variables (including inputs and outputs) in both code and assembler components are SSE variables, which means they are an array of four 32bit values = 128bits = 16 bytes. For SSE instructions whole SSE variable counts as a single value although it contains 4 individual numbers. Your processor also has 8 SSE registers named xmm0,…,xmm7 which also are 128bit wide and may hold full SSE variable. In code you may think of them as “temporary” variables – they are reused/reset in each component.
keywords are marked purple, variables are green and registers are blue. Floating point constants are gold and integer constants are dark red. Comments and some flowstone specific commands are black.
Inputs and outputs are declared in the very same way like in Code component. There is nothing to add to that.
Variables are also declared in similar way:
In assembler you may declare both floating point numbers and integers. Unlike in Code component in assembler the variable must have defined initial value even if it’s zero (in code component you may write “float x;” and the value is assumed 0). Also note, that in assembler, when you use variable that was not declared (usually by making a typo or forgoting to declare contants) the code will not mark error, however it will very likely crash. So Always double check triple check. With SSE instructions you can’t use constants within the code, so you always have to declare them as variables. A common practice is to adopt default names for values. I use F1=1; F2=2; F05=0.5; Fn6=-6; F4e7=40000000; etc.
Arrays are defined in same way like in code component. However notice that they are arrays of SSE variables, so array of size (3) will hold 3 SSE variables= 12 32bit values (of any format).
Stages are defined simply by typing “stage*;” (where * is the stage number). This command is used by Flowstone, not the assembler itself so it is marked black. Everything after that command will be executed in that stage until the next stage*; command is found. With no stages defined, the code is considered stage(2) like in the code component.
Now finally let’s get to some coding. Following DSP code operations work fully in SSE, so you should see only SSE opcodes when transfered to assembly: a=b; a+b; a-b; a*b; a/b; sqrt(a); a&b (logical AND); a|b (logical OR).
As you can see SSE operations start with keyword describing the operation (movaps= move aligned packed single value, mulps=multiply packed single values etc.). which is followed by names of operands (either variables or registers which hold the value) and sometimes a constant which further describes the code (not seen in the example above).
First operand is the destination operand – the one that will hold the result. This operand must be a register with one notable exception – movaps operation can write data form register to RAM as you can see in the last line (however it cannot read and write at the same time). Second operand is the source operand, which is not affected but its value is somehow used in the calculation.
So we can possibly rewrite the assembly back into pseudo-code:
As you can see, for example the line movaps xmm2,xmm1; (xmm2=xmm1) is not needed – the code makes copy of xmm1 although you could’ve easily used xmm1 in the rest of the code instead. This is the kind of errors the compilers make when compiling – they do not “see” this kind of errors. Also as I’ve mentioned above using variable is slower than using registers (the processor has to “call” the variable and wait until it arrives form RAM),yet the code is calling variable “in” several times in the code. we can rearrange the instructions slightly to remove these unnecessary calls:
This code is slightly more optimal than the original one. Also the DSP compiler is incapable of writing this kind of code all by itself – only way to do it is to write/edit the assembly manually. In some cases you may reduce the CPU load down to 20% of original this way. And this is the case of compilers in general – some are better than others, but none is perfect. For relatively simple operations done in high speed (which is the DSP – digital signal processing) this makes a big difference.
Next notable operation in assembly is the cmpps (compare packed single precision floats). this one has both destination and source operands which are the values to be compared and operation also has a integer constant which defines the mode. Here is the list of modes:
Result of cmpps is a 32bit mask with either all bits off (equal to zero both in float and int format) which is “false” or all bits on which is “true” (equal to QNaN in floating point and -1 in int format. Good tip how to initiate variable with “true” value: int true=-1;). Don’t bother remembering the modes – you can always create Code component and write the respective code part there and extract the assembly form the text output (at least that’s how I do it) or have a look at this image.
These bitmasks are then used by bitwise operations. They all work in similar way – go bit by bit and preform respective operation on the bits of the operands. Following opcodes are supported by FS assembler:
andps xmm0,xmm1; bitwise logical AND. DSP code version is “&”
orps xmm0,xmm1; bitwise logical OR. DSP code version is “|”
andnps xmm0,xmm1; bitwise logical AND NOT. this operation has syntax coloring bug – it will display as faulty code, but it will compile normally. I recommend not to use it unless you really need it.
A great feature of using these operations in assembly is, that you may create bitwise masks with custom binary structure. For example you may implement abs() operation by applying AND using bitmask with first bit off (which happens to be where the sign of the number is in floats). To create such bitmasks you may use this cool assember-maker’s toolkit which also shows this particular example in praxis.
Then there are operations minps and maxps. They have their full DSP code counterparts.
maxps/minps xmm0,xmm1; (compares source and destination operands and outputs the greater/smaller value into destination operand).
Now there are some opcodes in assembler that do not have their counterpats in DSP code component and may prove to be very useful:
cvtps2dq xmm0,xmm0; (convert packed single float to double quadword integer by rounding)
cvtdq2ps xmm0,xmm0; (convert double quadword integer to packed single float)
DSP code rndint() function is implemented by chaining these two. The use of integers will be shown in next section (array management).
paddd xmm0,variable; (packed add doubleword integer) this opcode is an exception in FS assembler – it can take only a variable as a source operand. In other operations they either can take both variable or register or sometimes register only.
shlld xmm0,control int; (Shift Left Logical Packed Data ) Preforms left logical shift on each of the four SSE channels in a register. The control int may be anywhere from 0 to 32. This operation may be used to multiply integer by given power of two.
shufps xmm0,xmm1,control int; (shuffle packed single floats) Now this operation is really neat. It lets you change order of SSE channels in registers. Channels 0 and 1 come from the source register and 2 and 3 from the destination register. Which input channel ends up in that specific output channel is specified by a control byte. You may easily create specific control bytes with assember-maker’s toolkit.
This operation may be implemented in Flowstone using pack and unpack primitives, however shufps operation is MUCH cheaper on CPU and also usable in a middle of a code.
That basically covers everything from the SSE instructions in assembly. Big advantage of these instructions is, that they do not produce crashes when the algorithm is faulty. Next time we will look at older parts of CPU, that allow us to do stuff not possible with SSE.