Sensor smoothing and optimised maths on the Arduino

As part of my attempt to load as much as possible onto an Arduino, I added a LIS302DL accelerometer into the mix and used the Y axis output to drive a strip of LEDs via a couple of TLC5916 LED drivers. I was sampling the accelerometer every 10msec and it jittered a little, so I wanted to add some smoothing to the sensor values. There are all sorts of fancy smoothing algorithms available, but a simple Exponential moving average worked just fine, and was easy to calculate:

    // choose a weighting factor W between 0 .. 1, then
    // current average = (current sensor value * W) + (last average * (1 - W))
    static uint16_t avg = 0;
    uint8_t val = getAccelValue();
    avg = val * 0.1 + avg * 0.9;

The actual sensor value is a signed 8-bit integer and the normal 1G reading (i.e. when it's being simply tilted and not waved around) ranges from around -55 to +55, so I capped the value at +/-50 and added 50 to it to get a value between 0 and 100 that I could then feed into the exponential moving average calculation before mapping the value onto the LED strip, with 'horizontal' resulting in the middle LED on the strip being illuminated, and the lit LED moving from side to side as the strip is tilted.

All well and good, the maths is simple and everything works fine. However the AVR doesn't have floating-point hardware support, so I wanted to see if I could improve the performance of the calculation above. Using integer arithmetic is the most obvious way, but that can't be done directly as there are fractional factors used in the calculation. However there's a standard way of doing this which is to use Fixed-point arithmetic. To do this, all the numbers involved are multiplied by a scaling factor to in effect move the decimal point to the right. An example would be doing financial calculations in pence rather than pounds. Addition/subtraction of scaled numbers works as normal as does multiplication/division of a scaled by an unscaled number. The operations that need special handling are multiplication/division of two scaled numbers. Here's why: if we have two numbers a and b that are scaled by S and we multiply them we get aS * bS. That simplifies to abS2, so to keep the scaling correct we need to divide by the scale factor after multiplication so we end up with abS. Division is the inverse, we have to multiply by S to keep things straight. Finally, as we are dealing with binary numbers, if we make the scale factor a power of two we can implement the scale factor calculations as efficient bit shifts rather than having to do multiplications and divisions.

There is in fact limited support for fixed-point math in AVR gcc but as far as I can tell it's only available as a patch, which would mean having to recompile my AVR toolchain. However it's simple enough to implement fixed-point yourself, so that's what I did. The first and important thing is to choose the scale factor, being careful to pick a value that doesn't result in overflow at any point in the calculation. In this case, I knew the range of sensor values was going to be between 0 and 100. If I choose a scale factor of 25 the calculation above in fixed-point form would therefore be:

    static uint16_t avg = 0;
    uint8_t val = getAccelValue();
    avg = ((val << 5) / 10) + (avg * 9 / 10);
    uint8_t currVal = (avg + 16) >> 5;

So the maximum intermediate value would be 100 * 25 * 9 or 28800 which is well within the bounds of a 16-bit unsigned integer. A couple of things to note: the scaling/unscaling is done by simply shifting by 5 bits left or right as appropriate, and the 0.1 and 0.9 multiplications of the floating-point version become / 10 and * 9 / 10 respectively. Finally, to get the unscaled average value we need to do the equivalent of adding 0.5 and rounding before unscaling the value which is what the addition of 16 is for - 25 / 2 = 16. Next I wrote a small program that fed a range of values into both the floating and fixed point versions and compared the results. In all cases, the difference between the floating and fixed point versions was no more than +-1, which was perfectly adequate. Because the sensor values are being sampled at 100Hz the fixed-point exponential moving average converges to the same value as the floating-point value within only a few samples even when there is a difference.

The next step was to actually measure how long the floating-point and fixed-point calculations took. To do that, I wrote a little stand-alone program that fed the values 0 .. 100 into the calculation, and then repeated that 1000 times. I then recorded the start and end time in microseconds around the loop, which gave me the time to execute 100,000 of the calculations. I also timed an empty loop containing just a NOP instruction so I could subtract the loop overhead from the calculation. The CPU is running at 16MHz and each NOP takes 1 cycle, so I can subtract the time for the NOPs from the total overhead. Without the NOPs there's a good chance the gcc optimiser will figure out that the loop does nothing and optimise it away entirely, defeating the purpose of the overhead calculation.

    // Overhead.
    uint32_t start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (int8_t val = 0; val <= 100; val++) {
            _NOP();
        }
    }
    uint32_t ohead = micros() - start - (1000L * 100 / 16);
    printf_P(PSTR("overhead %.2f\n"), ohead / 100000.0);

    // Floating point 1:10.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (int8_t val = 0; val <= 100; val++) {
            v1 = val * 0.1 + v1 * 0.9;
        }
    }
    printf_P(PSTR("floating point 1:10 %.2f\n"),
      (micros() - start - ohead) / 100000.0);avr200

    // Fixed point 1:10.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v2 = ((val << 5) / 10) + (v2 * 9 / 10);
            v3 = (v2 + 16) >> 5;
        }
    }
    printf_P(PSTR("fixed point 1:10 %.2f\n"),
      (micros() - start - ohead) / 100000.0);

OK, let's run that and look at the results, the CPU clock rate is 16MHz and the timings are in microseconds per calculation:

overhead 0.20
floating point 1:10 27.95
fixed point 1:10 32.48

Wait a minute - the fixed-point calculation is slower than the floating-point version! How on earth can that be? Hmmm. Well, the floating and fixed-point versions aren't exactly equivalent. The floating-point version uses two multiplies, the fixed-point version uses two divisions and a multiply. Let's rewrite the floating-point version and make it exactly equivalent to the fixed-point version:

    // Floating point 1:10 division.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v1 = val / 10.0 + v1 * 9.0 / 10.0;
        }
    }
    printf_P(PSTR("floating point 1:10 division %.2f\n"),
      (micros() - start - ohead) / 100000.0);

Now the results are:

overhead 0.20
floating point 1:10 27.95
fixed point 1:10 32.48
floating point 1:10 division 79.39

OK, that at least explains what the problem is - it's the division steps in the calculation. Whilst the AVR has 8-bit hardware multiply instructions, it has no hardware division. It turns out that division is harder and therefore and slower to implement than multiplication, both in hardware and software. Multiplication can be done by simple repeated addition, division requires test subtractions and comparisons. There's a good Atmel application note AVR200 that gives some comparative timings for multiplication and division implemented entirely in software:

ApplicationCode Size (Words)Execution Time (Cycles)
8 x 8 = 16 bit unsigned (Code Optimized)958
8 x 8 = 16 bit unsigned (Speed Optimized)3434
8 x 8 = 16 bit signed (Code Optimized)1073
16 x 16 = 32 bit unsigned (Code Optimized)14153
16 x 16 = 32 bit unsigned (Speed Optimized)105105
16 x 16 = 32 bit signed (Code Optimized)16218
8 / 8 = 8 + 8 bit unsigned (Code Optimized)1497
8 / 8 = 8 + 8 bit unsigned (Speed Optimized)6658
8 / 8 = 8 + 8 bit signed (Code Optimized)22103
16 / 16 = 16 + 16 bit unsigned (Code Optimized)19243
16 / 16 = 16 + 16 bit unsigned (Speed Optimized)196173
16 / 16 = 16 + 16 bit signed (Code Optimized)39255

So even without hardware division support there's a significant difference in the speed/space trade-offs between multiplication and division. Add in to that the fact that the AVR has hardware multiply but no hardware divide and it's not really surprising that there's a big performance difference between multiplication and division. The question is, is there anything we can do about it?

Well, it turns out there is. We avoided division in the fixed-point scaling operations by using bit shifts, we can do so with the exponential moving average calculation as well by choosing a decay factor that is a power of two. I chose 24 as the value, but I then needed to recheck that the intermediate results wouldn't overflow. As above, the biggest term in the calculation is 100 * 25 * 15 which is 48000, so we are OK. To keep the comparisons fair I modified the floating-point version to use the same decay factor as the fixed-point version, giving:

    // Floating point 1:16 multiplication.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v1 = val * 0.0625 + v1 * 0.9375;
        }
    }
    printf_P(PSTR("floating point 1:16 multiplication %.2f\n"),
      (micros() - start - ohead) / 100000.0);

    // Fixed-point multiplication & shift.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v2 = (val << 1) + ((v2 * 15) >> 4);
            v3 = (v2 + 16) >> 5;
        }
    }
    printf_P(PSTR("fixed point 1:16 multiplication & shift %.2f\n"),
      (micros() - start - ohead) / 100000.0);

Note the scaling of val by 25 followed by division by 24 is the same as shifting left by one bit, and the division of v2 by 16 is implemented by shifting it left 4 bits. OK, what timings do we get now?

floating point 1:16 multiplication 28.43
fixed point 1:16 multiplication & shift 4.99

OK, that looks much better, the fixed-point version is now 5.7x faster than the floating-point version, which is the sort of speed up we were looking for. But before we declare victory, is there any more juice to be squeezed out? Well, it turns out there is. Firstly, C always promotes integer operands in arithmetic operations to ints, which are 16 bits on the AVR, so our 16x8 bit calculation becomes a 16x16 bit one. Also, looking at the assembler output reveals that the bit shifts are all implemented as loops - the AVR can only shift a register one bit at a time. We can probably get some benefit by implementing our own 16x8 bit multiplication and unrolling the bit shift loops - time to pull out the AVR assembler documentation! Here's the code:

    // C equivalent
    // v1 = (val << 1) + ((v1 * 15) >> 4);
    // v2 = (v1 + 16) >> 5;
    asm(
    "; v1 *= 15\n"
    "       ldi   r16, 15\n"
    "       mov   r17, %B[v1]\n"
    "       mul   %A[v1], r16\n"
    "       movw  %[v1], r0\n"
    "       mulsu r17, r16\n"
    "       add   %B[v1], r0\n"
    "; v1 >>= 4\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "; val <<= 1\n"
    "       mov   r0, %[val]\n"
    "       lsl   r0\n"
    "; v1 += val\n"
    "       clr   r1\n"
    "       add   %A[v1], r0\n"
    "       adc   %B[v1], r1\n"
    "; v2 = v1\n"
    "       movw  %[v2], %[v1]\n"
    "; v2 += 16\n"
    "       ldi   r16, 16\n"
    "       add   %A[v2], r16\n"
    "       adc   %B[v2], r1\n"
    "; v2  >>= 5\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    : [v1] "+a" (v4), [v2] "=a" (v5)
    : [val] "r" (val)
    : "r16", "r17"
    )

The only slightly tricksy bit is implementing a 16x8 bit multiplication using the 8x8 bit hardware multiplier on the AVR. In principle it's no different to the way you were probably taught to do long multiplication in primary school, with some tweaks to take advantage of the fact that we know the result of the multiplication will always fit in 16 bits rather than 24. If you want more details on how this works I can recommend this blog post and the Atmel AVR201 application note "Using the AVR hardware Multiplier" for more information.

OK, what are the timings for the assembler version?

floating point 1:16 multiplication 28.43
fixed point 1:16 multiplication & shift 4.99
fixed point assembler 3.09

So the assembler version is 1.6x faster than the fixed-point C version, and nearly 10x faster than the floating-point version. Finally, what conclusions can we draw from all this? Well, the following ones leap out to me:

  • Division on the AVR is slow, whether it be floating-point or integer. Avoid it wherever you can.
  • If you have to use division, it may well be faster to use floating-point and multiply by the reciprocal of the numbers you were dividing by.
  • Fixed-point math is fairly easy to do and can yield significant performance benefits, as long as you avoid those pesky divisions. If you can't it may not offer much benefit over multiply-only floating-point.
  • Hand-coded assembler will help if you need to squeeze out every last cycle, but in absolute terms the speed-up you'll get by using C fixed-point multiply-and-shift-only will probably be sufficient.
Categories : AVR, Tech


Re: Sensor smoothing and optimised maths on the Arduino

My mind, it is blown. This is _exactly_ the kind of math-fiddling I used to do from time to time when number-crunching was unavoidable - including the part about writing custom 8x16 or 16x24 type math routines just to speed things up a bit more (then of course pondering how many test cases are enough to "prove" your shiny new math routine is in fact bug-free). The following is one of my favorites, a rather amusing example - illustrates dealing with a temperature reading coming in in 1/32 fractions of a degree, with an actual accuracy of tenths, which one would prefer to obtain:

    - The incoming 14-bit value from the sensor is in "0.03125-th" degrees Celsius;
    - "14-bit value * 0.03125" is in degrees Celsius, but obviously has a fractional part;
    - "14-bit value * 0.03125 * 10" is in TENTHS of degrees - just as we need it - therefore we can discard any remaining fractional part;
    - 0.03125 is the inverse of 32, therefore "14-bit value / 32 * 10" is the same as the above, in tenths of degrees;
    - "x / 32 * 10" = "x / 32 / 8 * 10 * 8" = "x / 256 * 80", therefore we can multiply by 80 then drop the LSB byte, and still get tenths of degrees;
    - The incoming 14-bit value comes in as the UPPER 14-bit of a 16-bit number, therefore it's ALREADY multiplied by 4 (if we zero out the 2 extra bits);
    - Therefore, "(16-bit number & 0xFFFC) * 20" is STILL in tenths of degrees after we drop the result's LSB byte, and that is a rather simple operation;

Who said math isn't fun, huh? :)

Re: Sensor smoothing and optimised maths on the Arduino

Nice example, clearly explained, thanks :-) There's a useful paper about reciprocal multiplication as a substitute for division here

.

Re: Sensor smoothing and optimised maths on the Arduino

Did you try this only on the Arduino ?

The avr gcc compiler can do a lot more than the default Arduino options. The newest avr gcc compiler (I tested the newest one with Ubuntu 12.10 alpha) has changed some optimizations.

I'm wandering if the getAccelValue() can be optimized ? If the is the default analogRead(), you can get a lot faster code by writing your own avr code.

P.S.: I did my optimizations mostly for the smallest program size, not for speed.

Re: Sensor smoothing and optimised maths on the Arduino

Whilst this was done on an Arduino board I didn't use any of the Arduino software whatsoever - see this earlier post for an explanation of why I don't use any of the Arduino libraries. So no, I wasn't using the Arduino library AnalogRead function, I was directly setting up the ADC and using interrupts to process the ADC completion events. I explain a little bit more about this is this other post, where I talk about performance monitoring on the AVR using a low-cost logic analyser.

Re: Sensor smoothing and optimised maths on the Arduino

I read your other posts after I wrote my remarks.

So you are way ahead of me.

I've seen your make file: http://bleaklow.com/files/2010/Makefile.master

But it seems just the normal optimizations. There is still plenty to try. The OPT_FLAGS still uses the '-Os' which is slow, and there are many more (-fwhole-program, -frename-registers, and so on).

Re: Sensor smoothing and optimised maths on the Arduino

You can optimize for speed or size, or some mix of the two, and on AVR size is usually most important, which is why I've leaned towards that side of things. The actual optimiser flags I use are

-Os -funsigned-char -funsigned-bitfields -fpack-struct -fshort-enums -ffunction-sections -fdata-sections -Wl,--gc-sections,--relax -fno-inline-small-functions -fno-tree-scev-cprop -fno-exceptions -ffreestanding -mcall-prologues

based on information from the gcc documentation and here.  I'm sure fiddling with the flags some more would speed things up a bit, but you'd never get the optimiser to give as good result as the hand-coded 16bit x 8bit multiply, because of C's integer arithmetic promotion rules, and because gcc can't know the range of the input value is restricted to 0-100.

Re: Sensor smoothing and optimised maths on the Arduino

You're right.

Using optimizations for speed might also speed up the calling and returning of functions, together with your hand-coded math, will probably improve it even more.

Re: Sensor smoothing and optimised maths on the Arduino

instead of v2*15, v2 << 4 - v2

better yet, don't save ave, save the sum and do a >>4 for the value.

http://github.com/tz1 has minigpsd and webgpsd that have NO floating point, and qrduino with the same although the qr code algorithms have things like mod 3.

conquer without dividing.

Re: Sensor smoothing and optimised maths on the Arduino

I haven't worked out the exact assembler but shifting 4 places and adding as an alternative to multiplication by 15 is unlikely to be much faster. That's because the AVR can only shift 1 bit left or right at a time, and in this case it's a 16bit value so each of the 4 bit shifts requires 2 cycles, then there's the addition to do and of course the necessary register load/saves. The AVR MUL instruction only takes 2 cycles, although of course it only does 8bit x 8bit multiplications, so you need 2 of them and an addition.

Re: Sensor smoothing and optimised maths on the Arduino

I did something similar back in the autumn. Polling analog temperature sensors once a second needed smoothing and I did not want to loose accuracy or take up Sram with floating point numbers so I chose to use intigers and multiply by 16 rather than 10 using the base formula:

Av n = Val n + (Av n-1)*15/16

It becomes 16Av n = 16Val n + 15(Av n-1) where I want to use (16AV n) further on in the programm to preserve accuracy

so using 16Av n as the variable in the code

becomes 16Av n =  (Val n << 4) + (16Av n-1)- ((16Av n-1)  >> 4)

It seems to work pretty well