How to use Assembler? part 3: ALU, FPU and array management

How to use Assembler? part 3: ALU, FPU and array management

10 Comments
 This article will come in 4 parts and will cover everything you will ever need to know about programming in assembler in Flowstone 3.0.5 and lower (including Synthmaker). This is part 3 and we will have a look at ALU and FPU parts of CPU which let us manage individual 32bit variables, including managing arrays.

So, we have covered the basics of SSE in assembler, but there are many things that are not supported by Flowstone assembler opcodes (or the SSE in general). These things must be done in older parts of CPU known as ALU (arithmetic and logical unit) and FPU (floating point processing unit). Unlike SSE which is relatively safe (in terms of crashes) ALU and FPU units may crash FS if you force them to do impossible/unsupported operations. ALWAYS EDIT YOUR ASSEMBLER CODE WITH YOUR SOUNDCARD OFF AND THE ASSEMBLER COMPONENT DISCONNECTED FROM ITS OUTPUTS! SAVE YOUR SCHEMATIC EVERY TIME BEFORE CONNECTING IT!

ALU unit has multiple registers that you can use in flowstone. All of them are 32bit wide and each has  its own special purpose unlike xmm registers in SSE which are all basically the same.

eax – this is a general purpose register. It is used to move 32bit values and can also hold address for loading things form RAM. It is also used for indexing in arrays.

ebx – is also multipurpose like eax. You have to use a little trick to use it (we will get to that later) otherwise the code will crash.

ecx – this register holds the sample index (on each new sample it increases by 1 – it is integer). This register is “read only” in FS. It is used in stock “hop” operation in Code component.

ebp – base pointer – holds the address of the base of the memory stack.

esp – stack pointer – holds the address of the top of the stack.

What is memory stack you ask? Remember, when code is compiled your operating system reserves a space in RAM where it can store its variables. Each variable is then called by the processor by its address (position in the RAM). ebp holds the address of the most bottom variable in the stack. All variables have their address ebp+positive integer. These integers are represented by variable names in programming languages. The top of the stack is pointed by esp. You may use push reg; to store the content of a 32bit register onto the stack. This increases esp by 4. pop reg; does the opposite – pops the value form top of the stack into the register and esp is decreased by 4. Program parts expect the esp to have the same value before computations are done and after they are done. That means you have to push and pop the same amount of times or the code crashes FS!!!

RAM

 

Why is this useful? ALU lets you access a 32bit value (either float or int) of any variable, every channel. And together with FPU and SSE it lets you read and write data from/to arrays and individual channels of SSE variables. Note that whenever you read/write data in RAM the data must be aligned. Also, operations done by ALU are basically CPU free (modern CPUs with multithreading and pipelining may preform multiple of these operations at once, or start new before the old ones are finished).

Before we can make use of ALU in FS we must learn something about FPU too. FPU is a part of the processor that does operations in floating point format. it has 8 registers which are 80bit float. (1bit sign 16bit exponent 63bit mantissa – can hold looselessly practically any value including 32bit integers). These registers work as a stack in FIFO fashion (First In First Out). They are named st(*) (where * is a number from 0 to 7). FPU instructions contain several operations that are not supported by SSE, at a price that you must process each channel separately (and may take a hell lot more CPU cos’ well… 80bit precission).

Following image shows the most common FPU operations and also on the left you may observe the contents of the FPU stack:

FPU

 

You may find more FPU operations supported by FS here (along with all other supported opcodes. The FPU ones always start with “f”). Also notice that we handle SSE variables as arrays. in the last two lines we store data to channels 0 and 1 of the variables respectively. And a useful tip – to pop specific st() register you may use fstp st(*);  (the * is a number form 0 to 7). This allows you to remove values from middle of the stack, although it does not let you store these values anywhere.

Now let’s get the the array management finally. When loading SSE variables into FPU you handle them as an array of 4 32bit values. With actual arrays things are more complicated. You must specify where in the array the variable is using eax register. Because address specifies the position of a byte and our 32bit values are 4bytes wide, the index must increment in 4byte steps per one value. If you try reading data form position unaligned to the real value boundaries FS will crash!!!  We may convert integer index to address based index by multiplying it with 4. Or more easily we may do this by logical shift left by 2 (using operation shr eax,2;  or pslld xmm0,2; ).

Here is a code for a simple delay. First part of the code cycles “c” in 0-44099 range. This value is used to write input into an array called mem. Then “delay” is subtracted form “c“, wrapped into  0-44099 range and that is used to extract value from array and store it into output. This will naturally work only in mono, because only in[0] and out[0] are processed.

simple delay

To make this delay work in mono4 and streams we halve to rearrange a few things. As mentioned before Arrays are arrays of SSE variables. Array of size 7 can hold 7 SSE variables = 7*4 normal 32bit floats. So we increase the array size to 44100 and we will use each channel of the SSE to hold values for that channel. The index calculations will stay the same, however after converting them to integer, we will use pslld xmm0,4; to multiply the index by 16 (SSE variables are aligned in 16bytes). With this setup index*16 +0 refers to channel (0) index*16 + 4 refers to channel (1) etc. Now we cannot use movd command – we must store the index in a temporary variable and read it channel by channel using mov eax, Temp [*] ; respectively. Code will now look something like this – very similar to what Code Component would output (note the code is written in two assembler components to fit one image. it should be in the same one):

streamdelay

As you can see, we are adding the channel offset 3 times (for channel 1,2 and 3 respectively). Now we may make use of stages. Stage(0) is executed first and only once (in the first sample). We can make an SSE variable (for example called offsets) that will hold all 4 offset values and we may add these offsets all at once using paddd. Because we can’t declare values in individual channels of SSE variables we will do so in stage(0). We will then add this offsets after each pslld before storing the indexes in Temp.

Also you may notice that in the writing section (the part that writes “in” into “mem”) the index for all 4 channels is the same (apart from the offsets). SSE operation movaps can also store values in arrays. However, keep in mind that movaps operates on SSE variables which are 16bytes wide. That naturally means the index must increment in multiples of 16. Now we can write all 4 channels into the mem at the same time using a single instruction, instead of chaining multiple mov and fld/fstp operations.

streamdelay2

Still remember, I’ve split the code into two parts just to squeeze them into one image without scrolling – to make it work you need to type it into one assembler.

Now we have covered basically every way to read and write data in arrays. Now is a good opportunity to show some very cool assembler optimizations – making use of bitwise masks and integers. If we use buffer with size of 2^N in binary representation the integer will look like this: 2=bin(00000010); 4=bin(00000100); 8=bin(00001000) etc. You surely noticed the pattern. Let’s use 8 as an example. Indexes of the values in buffer of size 8 will be in 0-7 range. all these values fit into 3bit notation ( 7=bin(00000111)). How to force program to use only first 3bits of the 32bit value and make all other bits zero? Answer bitwise AND operations with custom bitmasks. Cool thing about them is, that they work in both directions. integer -1 has binary representation of all bits on (-1=bin(…1111111)) – when you apply a bitmask with first 3 bits true it wraps up to 7. So with buffer sizes  2^N you may wrap them using bitwise AND with mask (2^N)-1

intLoop

Note that in Flowstone we can only ADD integers in assembly (opcodes for subtraction are unfortunately not supported) however you may add negative constants. In our original delay index increases by 1 every sample, therefore delay output lies at position (index-delay). If we reverse the indexing and always decrease it by 1 (=increase by -1) we may add delay to retrieve the delayed value. Now we can implement the wrapping to prevent many comparisons additions and conversions (which especially take very much CPU cycles).

delayOpt

Now you can see, how flexible is Assembler compared to Code component – this is in no way possible to write in DSP code. Nevertheless, function-wise you can still write code in DSP code component that works just as fine as this one (at cost of higher CPU).

Next time we will finally have a look at things that are completely impossible in Code component – proper code branching, looping and hopping. Also with some bonus stuff…

 

 

0 0 0 0 0
kohugaly

About the author:

Hello Synthmakers and Flowstoners! I'm 22 years old pharmacy student. You may know me form the FS and SM forums as KG_is_back. I'm using Flowstone since the times It was Synthmaker2. Whenever I'm not fiddling on a double-bass in some caves I'm lurking on the Flowstone forum ready to help if I can.

10 Comments

  1. Exo
    Exo  - October 16, 2014 - 11:02 am

    This is my favourite part so far :) SSE has always had a lot of discussion over on the forums but this part hasn’t. So thanks this has taught me a few things I didn’t know.

    • kohugaly
      kohugaly  - October 16, 2014 - 4:10 pm

      Yes, it’s a shame it wasn’t discussed more, but on the other hand I’m not surprised – there is very little relevant documentation on this part… I mean only information we have on this from Outsim/DSPr is that it’s x86 assembler and a list off opcodes in manual with no description of the opcodes at all (sparse description is on the WIKI though). For example there is no info on how exactly the FPU operations affect the stack – it took me an hour of testing and crashing FS to find out which operations work on which registers in the stack and what they push and pop. I’m thinking of creating a list of opcodes with their full description along with clear examples. Also I’d like to add average CPU load to the list which is nowhere to get. Perhaps martinvicanek could help me with this.

      • Exo
        Exo  - October 16, 2014 - 4:50 pm

        Yes it hasn’t been documented well, a proper list with descriptions and examples would be great. I could set up a page somewhere on the site or maybe just a PDF? Although I think a page would be better if there is multiple people editing it.

        • kohugaly
          kohugaly  - October 16, 2014 - 4:57 pm

          Yes, I think page would be better solution to this. It does not really fit to be article on blog. Anyway, imagine the annoying situation DSPR actually completely revising DSP code component and assembler just after we finally documented it :-D

  2. martinvicanek  - October 16, 2014 - 8:03 pm

    Some serious stuff in there, KG! Comparing your delay example to the one in Trogz Toolz I can find lots of similarities, except that you write all 4 SSE channels in one go. Great article, thanks for sharing!

    • kohugaly
      kohugaly  - October 17, 2014 - 12:57 am

      With limited resources and technology you are very likely to develop similar product. Reminds me Concorde… in the same time Russians developed Tu-144 which turned out to be almost completely identical to Concorde… except it crashed during airshow when pilot attempted to do a backflip (which by common sense is a very bad idea to try with a 100ton supersonic aircraft xD )

  3. martinvicanek  - October 18, 2014 - 4:10 pm

    It occurred to me that if you can write all 4 SSE channels in one step, like

    movaps mem[eax],xmm0;

    then it should also be possible to read all 4 channels in one step, like

    movaps xmm1,mem[eax];

    so the costly fld and fstp instructions could be avoided. Indeed, this works – however, this returns values for the same index for all SSE channels. Not what you want. :-(

    So I thought maybe 4 such reads and a little bit of shuffling and masking might still be faster than fld and fstp. This turned out to be very much the case! On my machine the following replacement cuts the cycles by almost a factor 3! :-O :-)

    /*
    paddd xmm1,offsets;
    movaps Temp,xmm1;
    mov eax,Temp[0]; fld mem[eax]; fstp out[0];
    mov eax,Temp[1]; fld mem[eax]; fstp out[1];
    mov eax,Temp[2]; fld mem[eax]; fstp out[2];
    mov eax,Temp[3]; fld mem[eax]; fstp out[3];
    */

    movd eax,xmm1; movaps xmm3,mem[eax];
    shufps xmm1,xmm1,57; // 0123 -> 1230 (1 is first)
    movd eax,xmm1; movaps xmm4,mem[eax];
    shufps xmm1,xmm1,57; // 1230 -> 2301 (2 is first)
    movd eax,xmm1; movaps xmm5,mem[eax];
    shufps xmm1,xmm1,57; // 2301 -> 3012 (3 is first)
    movd eax,xmm1; movaps xmm6,mem[eax];
    shufps xmm3,xmm4,68; // 1xxx,x2xx -> 1xx2 (x = don’t care)
    shufps xmm5,xmm6,238; // xx3x,xxx4 -> 3xx4
    shufps xmm3,xmm5,204; // 1xx2,3xx4 -> 1234

    So I wonder if this is a fast read access to arrays in general, could we use it for lookup tables, wave table oscillators, etc?

    • kohugaly
      kohugaly  - October 18, 2014 - 5:54 pm

      The delay above suppose to be poly compatible. For tasks like latency complensation were you will delay all channels the same amount movaps xmm?,array[eax]; is naturally the choice.

      Well, this never occurred to me… I mean that 4x movaps xmm?,array[eax]; will be faster than 4x fld [eax]; This is a game changer. For lookup tables you would have to nullify first 2 binary digits of the index, use that for Movaps xmm0,array[eax]
      and then use those two first binary digits to pick one of 4 shufps
      it would look something like this for one channel:
      movd eax,index;
      and eax,-4; //-4 has lowest two bits off
      movaps xmm0,table[eax];
      movd eax,index;
      and eax,3; //only lowerst two bits on;
      cmp eax,0;
      jnz chan1;
      shufps xmm0,xmm0,0; //this doest have to be there at all
      chan1:
      cmp eax,1;
      jnz chan2;
      shufps xmm0,xmm0,1; //moves ch(1) to ch(0)
      chan2:
      cmp eax,2;
      jnz chan3;
      shufps xmm0,xmm0,2; //moves ch(1) to ch(0)
      chan3:
      cmp eax,3;
      jnz chan4;
      shufps xmm0,xmm0,3; //moves ch(1) to ch(0)
      chan4:
      //now xmm0 contains the value in channel(0)
      //to read 4 channels the above part would have to be 4x and outputs would have to be shufps-ed.

      The problem might be, when the table is not size of N*16 bytes – the last values would be difficult to read and might crash.

      • martinvicanek  - October 21, 2014 - 8:35 pm

        Ah, now I get it. Too bad the third shufps parameter can’t be a variable, it would be much easier.
        Anyway, you can trade memory space for speed: If you create a 16byte aligned array where data is written only to the first SSE channel, say, then reading is much easier and requires not nearly as much shufpsing. Yes, you are wasting 75% RAM, but that is only a problem for very large arrays.

        • kohugaly
          kohugaly  - October 22, 2014 - 7:30 pm

          Now the question is: “is it really worth it?” Because you are not only wasting 75% of ram – you are creating a massive LAG spike on the first sample, to copy the stuff. That is completely useless for samplers with long sound files in them (to copy the wave using FPU would take more CPU than reading the part sample by sample FPU cos’ you know you usually don’t play the full file). From what I can tell, you can’t pre-compute the SSE array since mems aren’t SSE compatible. Even so, it would increase the loading time of the software quite a bit.

          EDIT: what I mean is, in synths and samplers the lag spike on first sample is what causes the gitches, because when your synth overloads your CPU on each new note, it is not usable for life performance. The CPU load on rest of the samples is not that groundbreaking, since the offline “rendering” is not so much of an issue.

Add Comment Register



Leave a comment

You must be logged in to post a comment.

Back to Top