How to use Assembler? part 3: ALU, FPU and array management
This article will come in 4 parts and will cover everything you will ever need to know about programming in assembler in Flowstone 3.0.5 and lower (including Synthmaker). This is part 3 and we will have a look at ALU and FPU parts of CPU which let us manage individual 32bit variables, including managing arrays.
So, we have covered the basics of SSE in assembler, but there are many things that are not supported by Flowstone assembler opcodes (or the SSE in general). These things must be done in older parts of CPU known as ALU (arithmetic and logical unit) and FPU (floating point processing unit). Unlike SSE which is relatively safe (in terms of crashes) ALU and FPU units may crash FS if you force them to do impossible/unsupported operations. ALWAYS EDIT YOUR ASSEMBLER CODE WITH YOUR SOUNDCARD OFF AND THE ASSEMBLER COMPONENT DISCONNECTED FROM ITS OUTPUTS! SAVE YOUR SCHEMATIC EVERY TIME BEFORE CONNECTING IT!
ALU unit has multiple registers that you can use in flowstone. All of them are 32bit wide and each has its own special purpose unlike xmm registers in SSE which are all basically the same.
eax – this is a general purpose register. It is used to move 32bit values and can also hold address for loading things form RAM. It is also used for indexing in arrays.
ebx – is also multipurpose like eax. You have to use a little trick to use it (we will get to that later) otherwise the code will crash.
ecx – this register holds the sample index (on each new sample it increases by 1 – it is integer). This register is “read only” in FS. It is used in stock “hop” operation in Code component.
ebp – base pointer – holds the address of the base of the memory stack.
esp – stack pointer – holds the address of the top of the stack.
What is memory stack you ask? Remember, when code is compiled your operating system reserves a space in RAM where it can store its variables. Each variable is then called by the processor by its address (position in the RAM). ebp holds the address of the most bottom variable in the stack. All variables have their address ebp+positive integer. These integers are represented by variable names in programming languages. The top of the stack is pointed by esp. You may use push reg; to store the content of a 32bit register onto the stack. This increases esp by 4. pop reg; does the opposite – pops the value form top of the stack into the register and esp is decreased by 4. Program parts expect the esp to have the same value before computations are done and after they are done. That means you have to push and pop the same amount of times or the code crashes FS!!!
Why is this useful? ALU lets you access a 32bit value (either float or int) of any variable, every channel. And together with FPU and SSE it lets you read and write data from/to arrays and individual channels of SSE variables. Note that whenever you read/write data in RAM the data must be aligned. Also, operations done by ALU are basically CPU free (modern CPUs with multithreading and pipelining may preform multiple of these operations at once, or start new before the old ones are finished).
Before we can make use of ALU in FS we must learn something about FPU too. FPU is a part of the processor that does operations in floating point format. it has 8 registers which are 80bit float. (1bit sign 16bit exponent 63bit mantissa – can hold looselessly practically any value including 32bit integers). These registers work as a stack in FIFO fashion (First In First Out). They are named st(*) (where * is a number from 0 to 7). FPU instructions contain several operations that are not supported by SSE, at a price that you must process each channel separately (and may take a hell lot more CPU cos’ well… 80bit precission).
Following image shows the most common FPU operations and also on the left you may observe the contents of the FPU stack:
You may find more FPU operations supported by FS here (along with all other supported opcodes. The FPU ones always start with “f”). Also notice that we handle SSE variables as arrays. in the last two lines we store data to channels 0 and 1 of the variables respectively. And a useful tip – to pop specific st() register you may use fstp st(*); (the * is a number form 0 to 7). This allows you to remove values from middle of the stack, although it does not let you store these values anywhere.
Now let’s get the the array management finally. When loading SSE variables into FPU you handle them as an array of 4 32bit values. With actual arrays things are more complicated. You must specify where in the array the variable is using eax register. Because address specifies the position of a byte and our 32bit values are 4bytes wide, the index must increment in 4byte steps per one value. If you try reading data form position unaligned to the real value boundaries FS will crash!!! We may convert integer index to address based index by multiplying it with 4. Or more easily we may do this by logical shift left by 2 (using operation shr eax,2; or pslld xmm0,2; ).
Here is a code for a simple delay. First part of the code cycles “c” in 0-44099 range. This value is used to write input into an array called mem. Then “delay” is subtracted form “c“, wrapped into 0-44099 range and that is used to extract value from array and store it into output. This will naturally work only in mono, because only in and out are processed.
To make this delay work in mono4 and streams we halve to rearrange a few things. As mentioned before Arrays are arrays of SSE variables. Array of size 7 can hold 7 SSE variables = 7*4 normal 32bit floats. So we increase the array size to 44100 and we will use each channel of the SSE to hold values for that channel. The index calculations will stay the same, however after converting them to integer, we will use pslld xmm0,4; to multiply the index by 16 (SSE variables are aligned in 16bytes). With this setup index*16 +0 refers to channel (0) index*16 + 4 refers to channel (1) etc. Now we cannot use movd command – we must store the index in a temporary variable and read it channel by channel using mov eax, Temp [*] ; respectively. Code will now look something like this – very similar to what Code Component would output (note the code is written in two assembler components to fit one image. it should be in the same one):
As you can see, we are adding the channel offset 3 times (for channel 1,2 and 3 respectively). Now we may make use of stages. Stage(0) is executed first and only once (in the first sample). We can make an SSE variable (for example called offsets) that will hold all 4 offset values and we may add these offsets all at once using paddd. Because we can’t declare values in individual channels of SSE variables we will do so in stage(0). We will then add this offsets after each pslld before storing the indexes in Temp.
Also you may notice that in the writing section (the part that writes “in” into “mem”) the index for all 4 channels is the same (apart from the offsets). SSE operation movaps can also store values in arrays. However, keep in mind that movaps operates on SSE variables which are 16bytes wide. That naturally means the index must increment in multiples of 16. Now we can write all 4 channels into the mem at the same time using a single instruction, instead of chaining multiple mov and fld/fstp operations.
Still remember, I’ve split the code into two parts just to squeeze them into one image without scrolling – to make it work you need to type it into one assembler.
Now we have covered basically every way to read and write data in arrays. Now is a good opportunity to show some very cool assembler optimizations – making use of bitwise masks and integers. If we use buffer with size of 2^N in binary representation the integer will look like this: 2=bin(00000010); 4=bin(00000100); 8=bin(00001000) etc. You surely noticed the pattern. Let’s use 8 as an example. Indexes of the values in buffer of size 8 will be in 0-7 range. all these values fit into 3bit notation ( 7=bin(00000111)). How to force program to use only first 3bits of the 32bit value and make all other bits zero? Answer bitwise AND operations with custom bitmasks. Cool thing about them is, that they work in both directions. integer -1 has binary representation of all bits on (-1=bin(…1111111)) – when you apply a bitmask with first 3 bits true it wraps up to 7. So with buffer sizes 2^N you may wrap them using bitwise AND with mask (2^N)-1
Note that in Flowstone we can only ADD integers in assembly (opcodes for subtraction are unfortunately not supported) however you may add negative constants. In our original delay index increases by 1 every sample, therefore delay output lies at position (index-delay). If we reverse the indexing and always decrease it by 1 (=increase by -1) we may add delay to retrieve the delayed value. Now we can implement the wrapping to prevent many comparisons additions and conversions (which especially take very much CPU cycles).
Now you can see, how flexible is Assembler compared to Code component – this is in no way possible to write in DSP code. Nevertheless, function-wise you can still write code in DSP code component that works just as fine as this one (at cost of higher CPU).
Next time we will finally have a look at things that are completely impossible in Code component – proper code branching, looping and hopping. Also with some bonus stuff…