# Sensor smoothing and optimised maths on the Arduino

As part of my attempt to load as much as possible onto an Arduino, I added a LIS302DL accelerometer into the mix and used the Y axis output to drive a strip of LEDs via a couple of TLC5916 LED drivers. I was sampling the accelerometer every 10msec and it jittered a little, so I wanted to add some smoothing to the sensor values. There are all sorts of fancy smoothing algorithms available, but a simple Exponential moving average worked just fine, and was easy to calculate:

// choose a weighting factor W between 0 .. 1, then // current average = (current sensor value * W) + (last average * (1 - W)) static uint16_t avg = 0; uint8_t val = getAccelValue(); avg = val * 0.1 + avg * 0.9;

The actual sensor value is a signed 8-bit integer and the normal 1G reading (i.e. when it's being simply tilted and not waved around) ranges from around -55 to +55, so I capped the value at +/-50 and added 50 to it to get a value between 0 and 100 that I could then feed into the exponential moving average calculation before mapping the value onto the LED strip, with 'horizontal' resulting in the middle LED on the strip being illuminated, and the lit LED moving from side to side as the strip is tilted.

All well and good, the maths is simple and everything works fine. However the AVR doesn't have floating-point hardware support, so I wanted to see if I could improve the performance of the calculation above. Using integer arithmetic is the most obvious way, but that can't be done directly as there are fractional factors used in the calculation. However there's a standard way of doing this which is to use Fixed-point arithmetic. To do this, all the numbers involved are multiplied by a scaling factor to in effect move the decimal point to the right. An example would be doing financial calculations in pence rather than pounds. Addition/subtraction of scaled numbers works as normal as does multiplication/division of a scaled by an unscaled number. The operations that need special handling are multiplication/division of two scaled numbers. Here's why: if we have two numbers `a`

and `b`

that are scaled by `S`

and we multiply them we get `aS * bS`

. That simplifies to `abS`

, so to keep the scaling correct we need to divide by the scale factor after multiplication so we end up with ^{2}`abS`

. Division is the inverse, we have to multiply by S to keep things straight. Finally, as we are dealing with binary numbers, if we make the scale factor a power of two we can implement the scale factor calculations as efficient bit shifts rather than having to do multiplications and divisions.

There is in fact limited support for fixed-point math in AVR gcc but as far as I can tell it's only available as a patch, which would mean having to recompile my AVR toolchain. However it's simple enough to implement fixed-point yourself, so that's what I did. The first and important thing is to choose the scale factor, being careful to pick a value that doesn't result in overflow at any point in the calculation. In this case, I knew the range of sensor values was going to be between 0 and 100. If I choose a scale factor of 2^{5} the calculation above in fixed-point form would therefore be:

static uint16_t avg = 0; uint8_t val = getAccelValue(); avg = ((val << 5) / 10) + (avg * 9 / 10); uint8_t currVal = (avg + 16) >> 5;

So the maximum intermediate value would be `100 * 2`

or 28800 which is well within the bounds of a 16-bit unsigned integer. A couple of things to note: the scaling/unscaling is done by simply shifting by 5 bits left or right as appropriate, and the 0.1 and 0.9 multiplications of the floating-point version become ^{5} * 9`/ 10`

and `* 9 / 10`

respectively. Finally, to get the unscaled average value we need to do the equivalent of adding 0.5 and rounding before unscaling the value which is what the addition of 16 is for - `2`

. Next I wrote a small program that fed a range of values into both the floating and fixed point versions and compared the results. In all cases, the difference between the floating and fixed point versions was no more than +-1, which was perfectly adequate. Because the sensor values are being sampled at 100Hz the fixed-point exponential moving average converges to the same value as the floating-point value within only a few samples even when there is a difference.
^{5} / 2 = 16

The next step was to actually measure how long the floating-point and fixed-point calculations took. To do that, I wrote a little stand-alone program that fed the values 0 .. 100 into the calculation, and then repeated that 1000 times. I then recorded the start and end time in microseconds around the loop, which gave me the time to execute 100,000 of the calculations. I also timed an empty loop containing just a NOP instruction so I could subtract the loop overhead from the calculation. The CPU is running at 16MHz and each NOP takes 1 cycle, so I can subtract the time for the NOPs from the total overhead. Without the NOPs there's a good chance the gcc optimiser will figure out that the loop does nothing and optimise it away entirely, defeating the purpose of the overhead calculation.

// Overhead. uint32_t start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (int8_t val = 0; val <= 100; val++) { _NOP(); } } uint32_t ohead = micros() - start - (1000L * 100 / 16); printf_P(PSTR("overhead %.2f\n"), ohead / 100000.0); // Floating point 1:10. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (int8_t val = 0; val <= 100; val++) { v1 = val * 0.1 + v1 * 0.9; } } printf_P(PSTR("floating point 1:10 %.2f\n"), (micros() - start - ohead) / 100000.0);avr200 // Fixed point 1:10. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v2 = ((val << 5) / 10) + (v2 * 9 / 10); v3 = (v2 + 16) >> 5; } } printf_P(PSTR("fixed point 1:10 %.2f\n"), (micros() - start - ohead) / 100000.0);

OK, let's run that and look at the results, the CPU clock rate is 16MHz and the timings are in microseconds per calculation:

overhead 0.20 floating point 1:10 27.95 fixed point 1:10 32.48

Wait a minute - the fixed-point calculation is **slower** than the floating-point version! How on earth can that be? Hmmm. Well, the floating and fixed-point versions aren't exactly equivalent. The floating-point version uses two multiplies, the fixed-point version uses two divisions and a multiply. Let's rewrite the floating-point version and make it exactly equivalent to the fixed-point version:

// Floating point 1:10 division. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v1 = val / 10.0 + v1 * 9.0 / 10.0; } } printf_P(PSTR("floating point 1:10 division %.2f\n"), (micros() - start - ohead) / 100000.0);

Now the results are:

overhead 0.20 floating point 1:10 27.95 fixed point 1:10 32.48 floating point 1:10 division 79.39

OK, that at least explains what the problem is - it's the division steps in the calculation. Whilst the AVR has 8-bit hardware multiply instructions, it has no hardware division. It turns out that division is harder and therefore and slower to implement than multiplication, both in hardware and software. Multiplication can be done by simple repeated addition, division requires test subtractions and comparisons. There's a good Atmel application note AVR200 that gives some comparative timings for multiplication and division implemented entirely in software:

Application | Code Size (Words) | Execution Time (Cycles) |
---|---|---|

8 x 8 = 16 bit unsigned (Code Optimized) | 9 | 58 |

8 x 8 = 16 bit unsigned (Speed Optimized) | 34 | 34 |

8 x 8 = 16 bit signed (Code Optimized) | 10 | 73 |

16 x 16 = 32 bit unsigned (Code Optimized) | 14 | 153 |

16 x 16 = 32 bit unsigned (Speed Optimized) | 105 | 105 |

16 x 16 = 32 bit signed (Code Optimized) | 16 | 218 |

8 / 8 = 8 + 8 bit unsigned (Code Optimized) | 14 | 97 |

8 / 8 = 8 + 8 bit unsigned (Speed Optimized) | 66 | 58 |

8 / 8 = 8 + 8 bit signed (Code Optimized) | 22 | 103 |

16 / 16 = 16 + 16 bit unsigned (Code Optimized) | 19 | 243 |

16 / 16 = 16 + 16 bit unsigned (Speed Optimized) | 196 | 173 |

16 / 16 = 16 + 16 bit signed (Code Optimized) | 39 | 255 |

So even without hardware division support there's a significant difference in the speed/space trade-offs between multiplication and division. Add in to that the fact that the AVR has hardware multiply but no hardware divide and it's not really surprising that there's a big performance difference between multiplication and division. The question is, is there anything we can do about it?

Well, it turns out there is. We avoided division in the fixed-point scaling operations by using bit shifts, we can do so with the exponential moving average calculation as well by choosing a decay factor that is a power of two. I chose 2^{4} as the value, but I then needed to recheck that the intermediate results wouldn't overflow. As above, the biggest term in the calculation is `100 * 2`

which is 48000, so we are OK. To keep the comparisons fair I modified the floating-point version to use the same decay factor as the fixed-point version, giving:
^{5} * 15

// Floating point 1:16 multiplication. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v1 = val * 0.0625 + v1 * 0.9375; } } printf_P(PSTR("floating point 1:16 multiplication %.2f\n"), (micros() - start - ohead) / 100000.0); // Fixed-point multiplication & shift. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v2 = (val << 1) + ((v2 * 15) >> 4); v3 = (v2 + 16) >> 5; } } printf_P(PSTR("fixed point 1:16 multiplication & shift %.2f\n"), (micros() - start - ohead) / 100000.0);

Note the scaling of `val`

by 2^{5} followed by division by 2^{4} is the same as shifting left by one bit, and the division of `v2`

by 16 is implemented by shifting it left 4 bits. OK, what timings do we get now?

floating point 1:16 multiplication 28.43 fixed point 1:16 multiplication & shift 4.99

OK, that looks **much** better, the fixed-point version is now 5.7x faster than the floating-point version, which is the sort of speed up we were looking for. But before we declare victory, is there any more juice to be squeezed out? Well, it turns out there is. Firstly, C always promotes integer operands in arithmetic operations to ints, which are 16 bits on the AVR, so our 16x8 bit calculation becomes a 16x16 bit one. Also, looking at the assembler output reveals that the bit shifts are all implemented as loops - the AVR can only shift a register one bit at a time. We can probably get some benefit by implementing our own 16x8 bit multiplication and unrolling the bit shift loops - time to pull out the AVR assembler documentation! Here's the code:

// C equivalent // v1 = (val << 1) + ((v1 * 15) >> 4); // v2 = (v1 + 16) >> 5; asm( "; v1 *= 15\n" " ldi r16, 15\n" " mov r17, %B[v1]\n" " mul %A[v1], r16\n" " movw %[v1], r0\n" " mulsu r17, r16\n" " add %B[v1], r0\n" "; v1 >>= 4\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" "; val <<= 1\n" " mov r0, %[val]\n" " lsl r0\n" "; v1 += val\n" " clr r1\n" " add %A[v1], r0\n" " adc %B[v1], r1\n" "; v2 = v1\n" " movw %[v2], %[v1]\n" "; v2 += 16\n" " ldi r16, 16\n" " add %A[v2], r16\n" " adc %B[v2], r1\n" "; v2 >>= 5\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" : [v1] "+a" (v4), [v2] "=a" (v5) : [val] "r" (val) : "r16", "r17" )

The only slightly tricksy bit is implementing a 16x8 bit multiplication using the 8x8 bit hardware multiplier on the AVR. In principle it's no different to the way you were probably taught to do long multiplication in primary school, with some tweaks to take advantage of the fact that we know the result of the multiplication will always fit in 16 bits rather than 24. If you want more details on how this works I can recommend this blog post and the Atmel AVR201 application note "Using the AVR hardware Multiplier" for more information.

OK, what are the timings for the assembler version?floating point 1:16 multiplication 28.43 fixed point 1:16 multiplication & shift 4.99 fixed point assembler 3.09

So the assembler version is 1.6x faster than the fixed-point C version, and nearly 10x faster than the floating-point version. Finally, what conclusions can we draw from all this? Well, the following ones leap out to me:

- Division on the AVR is
**slow**, whether it be floating-point or integer. Avoid it wherever you can. - If you
**have**to use division, it may well be faster to use floating-point and multiply by the reciprocal of the numbers you were dividing by. - Fixed-point math is fairly easy to do and can yield significant performance benefits, as long as you avoid those pesky divisions. If you can't it may not offer much benefit over multiply-only floating-point.
- Hand-coded assembler will help if you need to squeeze out every last cycle, but in absolute terms the speed-up you'll get by using C fixed-point multiply-and-shift-only will probably be sufficient.