Sensor smoothing and optimised maths on the Arduino

As part of my attempt to load as much as possible onto an Arduino, I added a LIS302DL accelerometer into the mix and used the Y axis output to drive a strip of LEDs via a couple of TLC5916 LED drivers. I was sampling the accelerometer every 10msec and it jittered a little, so I wanted to add some smoothing to the sensor values. There are all sorts of fancy smoothing algorithms available, but a simple Exponential moving average worked just fine, and was easy to calculate:

    // choose a weighting factor W between 0 .. 1, then
    // current average = (current sensor value * W) + (last average * (1 - W))
    static uint16_t avg = 0;
    uint8_t val = getAccelValue();
    avg = val * 0.1 + avg * 0.9;

The actual sensor value is a signed 8-bit integer and the normal 1G reading (i.e. when it's being simply tilted and not waved around) ranges from around -55 to +55, so I capped the value at +/-50 and added 50 to it to get a value between 0 and 100 that I could then feed into the exponential moving average calculation before mapping the value onto the LED strip, with 'horizontal' resulting in the middle LED on the strip being illuminated, and the lit LED moving from side to side as the strip is tilted.

All well and good, the maths is simple and everything works fine. However the AVR doesn't have floating-point hardware support, so I wanted to see if I could improve the performance of the calculation above. Using integer arithmetic is the most obvious way, but that can't be done directly as there are fractional factors used in the calculation. However there's a standard way of doing this which is to use Fixed-point arithmetic. To do this, all the numbers involved are multiplied by a scaling factor to in effect move the decimal point to the right. An example would be doing financial calculations in pence rather than pounds. Addition/subtraction of scaled numbers works as normal as does multiplication/division of a scaled by an unscaled number. The operations that need special handling are multiplication/division of two scaled numbers. Here's why: if we have two numbers a and b that are scaled by S and we multiply them we get aS * bS. That simplifies to abS2, so to keep the scaling correct we need to divide by the scale factor after multiplication so we end up with abS. Division is the inverse, we have to multiply by S to keep things straight. Finally, as we are dealing with binary numbers, if we make the scale factor a power of two we can implement the scale factor calculations as efficient bit shifts rather than having to do multiplications and divisions.

There is in fact limited support for fixed-point math in AVR gcc but as far as I can tell it's only available as a patch, which would mean having to recompile my AVR toolchain. However it's simple enough to implement fixed-point yourself, so that's what I did. The first and important thing is to choose the scale factor, being careful to pick a value that doesn't result in overflow at any point in the calculation. In this case, I knew the range of sensor values was going to be between 0 and 100. If I choose a scale factor of 25 the calculation above in fixed-point form would therefore be:

    static uint16_t avg = 0;
    uint8_t val = getAccelValue();
    avg = ((val << 5) / 10) + (avg * 9 / 10);
    uint8_t currVal = (avg + 16) >> 5;

So the maximum intermediate value would be 100 * 25 * 9 or 28800 which is well within the bounds of a 16-bit unsigned integer. A couple of things to note: the scaling/unscaling is done by simply shifting by 5 bits left or right as appropriate, and the 0.1 and 0.9 multiplications of the floating-point version become / 10 and * 9 / 10 respectively. Finally, to get the unscaled average value we need to do the equivalent of adding 0.5 and rounding before unscaling the value which is what the addition of 16 is for - 25 / 2 = 16. Next I wrote a small program that fed a range of values into both the floating and fixed point versions and compared the results. In all cases, the difference between the floating and fixed point versions was no more than +-1, which was perfectly adequate. Because the sensor values are being sampled at 100Hz the fixed-point exponential moving average converges to the same value as the floating-point value within only a few samples even when there is a difference.

The next step was to actually measure how long the floating-point and fixed-point calculations took. To do that, I wrote a little stand-alone program that fed the values 0 .. 100 into the calculation, and then repeated that 1000 times. I then recorded the start and end time in microseconds around the loop, which gave me the time to execute 100,000 of the calculations. I also timed an empty loop containing just a NOP instruction so I could subtract the loop overhead from the calculation. The CPU is running at 16MHz and each NOP takes 1 cycle, so I can subtract the time for the NOPs from the total overhead. Without the NOPs there's a good chance the gcc optimiser will figure out that the loop does nothing and optimise it away entirely, defeating the purpose of the overhead calculation.

    // Overhead.
    uint32_t start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (int8_t val = 0; val <= 100; val++) {
    uint32_t ohead = micros() - start - (1000L * 100 / 16);
    printf_P(PSTR("overhead %.2f\n"), ohead / 100000.0);

    // Floating point 1:10.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (int8_t val = 0; val <= 100; val++) {
            v1 = val * 0.1 + v1 * 0.9;
    printf_P(PSTR("floating point 1:10 %.2f\n"),
      (micros() - start - ohead) / 100000.0);avr200

    // Fixed point 1:10.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v2 = ((val << 5) / 10) + (v2 * 9 / 10);
            v3 = (v2 + 16) >> 5;
    printf_P(PSTR("fixed point 1:10 %.2f\n"),
      (micros() - start - ohead) / 100000.0);

OK, let's run that and look at the results, the CPU clock rate is 16MHz and the timings are in microseconds per calculation:

overhead 0.20
floating point 1:10 27.95
fixed point 1:10 32.48

Wait a minute - the fixed-point calculation is slower than the floating-point version! How on earth can that be? Hmmm. Well, the floating and fixed-point versions aren't exactly equivalent. The floating-point version uses two multiplies, the fixed-point version uses two divisions and a multiply. Let's rewrite the floating-point version and make it exactly equivalent to the fixed-point version:

    // Floating point 1:10 division.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v1 = val / 10.0 + v1 * 9.0 / 10.0;
    printf_P(PSTR("floating point 1:10 division %.2f\n"),
      (micros() - start - ohead) / 100000.0);

Now the results are:

overhead 0.20
floating point 1:10 27.95
fixed point 1:10 32.48
floating point 1:10 division 79.39

OK, that at least explains what the problem is - it's the division steps in the calculation. Whilst the AVR has 8-bit hardware multiply instructions, it has no hardware division. It turns out that division is harder and therefore and slower to implement than multiplication, both in hardware and software. Multiplication can be done by simple repeated addition, division requires test subtractions and comparisons. There's a good Atmel application note AVR200 that gives some comparative timings for multiplication and division implemented entirely in software:

ApplicationCode Size (Words)Execution Time (Cycles)
8 x 8 = 16 bit unsigned (Code Optimized)958
8 x 8 = 16 bit unsigned (Speed Optimized)3434
8 x 8 = 16 bit signed (Code Optimized)1073
16 x 16 = 32 bit unsigned (Code Optimized)14153
16 x 16 = 32 bit unsigned (Speed Optimized)105105
16 x 16 = 32 bit signed (Code Optimized)16218
8 / 8 = 8 + 8 bit unsigned (Code Optimized)1497
8 / 8 = 8 + 8 bit unsigned (Speed Optimized)6658
8 / 8 = 8 + 8 bit signed (Code Optimized)22103
16 / 16 = 16 + 16 bit unsigned (Code Optimized)19243
16 / 16 = 16 + 16 bit unsigned (Speed Optimized)196173
16 / 16 = 16 + 16 bit signed (Code Optimized)39255

So even without hardware division support there's a significant difference in the speed/space trade-offs between multiplication and division. Add in to that the fact that the AVR has hardware multiply but no hardware divide and it's not really surprising that there's a big performance difference between multiplication and division. The question is, is there anything we can do about it?

Well, it turns out there is. We avoided division in the fixed-point scaling operations by using bit shifts, we can do so with the exponential moving average calculation as well by choosing a decay factor that is a power of two. I chose 24 as the value, but I then needed to recheck that the intermediate results wouldn't overflow. As above, the biggest term in the calculation is 100 * 25 * 15 which is 48000, so we are OK. To keep the comparisons fair I modified the floating-point version to use the same decay factor as the fixed-point version, giving:

    // Floating point 1:16 multiplication.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v1 = val * 0.0625 + v1 * 0.9375;
    printf_P(PSTR("floating point 1:16 multiplication %.2f\n"),
      (micros() - start - ohead) / 100000.0);

    // Fixed-point multiplication & shift.
    start = micros();
    for (uint16_t i = 1000; i != 0; i--) {
        for (uint8_t val = 0; val <= 100; val++) {
            v2 = (val << 1) + ((v2 * 15) >> 4);
            v3 = (v2 + 16) >> 5;
    printf_P(PSTR("fixed point 1:16 multiplication & shift %.2f\n"),
      (micros() - start - ohead) / 100000.0);

Note the scaling of val by 25 followed by division by 24 is the same as shifting left by one bit, and the division of v2 by 16 is implemented by shifting it left 4 bits. OK, what timings do we get now?

floating point 1:16 multiplication 28.43
fixed point 1:16 multiplication & shift 4.99

OK, that looks much better, the fixed-point version is now 5.7x faster than the floating-point version, which is the sort of speed up we were looking for. But before we declare victory, is there any more juice to be squeezed out? Well, it turns out there is. Firstly, C always promotes integer operands in arithmetic operations to ints, which are 16 bits on the AVR, so our 16x8 bit calculation becomes a 16x16 bit one. Also, looking at the assembler output reveals that the bit shifts are all implemented as loops - the AVR can only shift a register one bit at a time. We can probably get some benefit by implementing our own 16x8 bit multiplication and unrolling the bit shift loops - time to pull out the AVR assembler documentation! Here's the code:

    // C equivalent
    // v1 = (val << 1) + ((v1 * 15) >> 4);
    // v2 = (v1 + 16) >> 5;
    "; v1 *= 15\n"
    "       ldi   r16, 15\n"
    "       mov   r17, %B[v1]\n"
    "       mul   %A[v1], r16\n"
    "       movw  %[v1], r0\n"
    "       mulsu r17, r16\n"
    "       add   %B[v1], r0\n"
    "; v1 >>= 4\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "       lsr   %B[v1]\n"
    "       ror   %A[v1]\n"
    "; val <<= 1\n"
    "       mov   r0, %[val]\n"
    "       lsl   r0\n"
    "; v1 += val\n"
    "       clr   r1\n"
    "       add   %A[v1], r0\n"
    "       adc   %B[v1], r1\n"
    "; v2 = v1\n"
    "       movw  %[v2], %[v1]\n"
    "; v2 += 16\n"
    "       ldi   r16, 16\n"
    "       add   %A[v2], r16\n"
    "       adc   %B[v2], r1\n"
    "; v2  >>= 5\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    "       lsr   %B[v2]\n"
    "       ror   %A[v2]\n"
    : [v1] "+a" (v4), [v2] "=a" (v5)
    : [val] "r" (val)
    : "r16", "r17"

The only slightly tricksy bit is implementing a 16x8 bit multiplication using the 8x8 bit hardware multiplier on the AVR. In principle it's no different to the way you were probably taught to do long multiplication in primary school, with some tweaks to take advantage of the fact that we know the result of the multiplication will always fit in 16 bits rather than 24. If you want more details on how this works I can recommend this blog post and the Atmel AVR201 application note "Using the AVR hardware Multiplier" for more information.

OK, what are the timings for the assembler version?

floating point 1:16 multiplication 28.43
fixed point 1:16 multiplication & shift 4.99
fixed point assembler 3.09

So the assembler version is 1.6x faster than the fixed-point C version, and nearly 10x faster than the floating-point version. Finally, what conclusions can we draw from all this? Well, the following ones leap out to me:

  • Division on the AVR is slow, whether it be floating-point or integer. Avoid it wherever you can.
  • If you have to use division, it may well be faster to use floating-point and multiply by the reciprocal of the numbers you were dividing by.
  • Fixed-point math is fairly easy to do and can yield significant performance benefits, as long as you avoid those pesky divisions. If you can't it may not offer much benefit over multiply-only floating-point.
  • Hand-coded assembler will help if you need to squeeze out every last cycle, but in absolute terms the speed-up you'll get by using C fixed-point multiply-and-shift-only will probably be sufficient.
Categories : Tech, AVR

AVR performance monitoring using the OpenBench Logic Sniffer

Now I've ditched the Arduino platform I wondered just how much 'juice' I could squeeze out of a 16Mhz ATMega328P, as used in the Arduino Duemilanove. According to Atmel it should be capable of around 16 MIPS, which puts it ahead of a 1985-vintage Intel 80386DX which could do about 11 MIPS at 33Mhz, albeit the 80386 is a 32-bit processor.

Accordingly, I set about sticking as many things from my parts box as I could onto my duemilanove board. I ended up with:

  • A piezo sensor, signal preconditioned with this circuit and being sampled with the ADC every msec. The sampled signal was further processed to do peak detection, trigger level quantization, hysteresis and double/triple beat detection.
  • Bit-banging a 16-LED chain driven by two daisy-chained TLC5916 LED drivers, animating the LEDs based on the piezo trigger events.
  • Reading (X,Y,Z) information from a LIS302DL accelerometer every 10 msec, over SPI running at 4Mhz. The Y axis data was a bit noisy so I implemented an exponential moving average to smooth the values. I did this using fixed-point arithmetic rather than floating point - on the AVR, floating point arithmetic has to be done in software so it's expensive, I'll describe how I did this in a later post.
  • Bit-banging another 16-LED chain as above to display the output of the accelerometer as a 'moving dot' that goes back and forth as the accelerometer is tilted.
  • Doing continuous fades of a RGB LED, 8 bits per colour. The fades were implemented entirely in software at an effective PWM clock rate of 31KHz, although I used a more efficient implementation than 'classic' PWM - I'll cover how I did that in a subsequent post.
  • A serial monitor to allow various parameters to be adjusted, e.g. piezo trigger levels, and allowing debugging output to be enabled and disabled. The baud rate was set to 115 Kbaud.

All this was implemented with a mix of interrupts and a variant of my task manager. Even with the maximum level of debugging enabled, everything continued working smoothly. However I had no real idea of how much headroom the CPU still had left, so I thought about how I might instrument the application to find out.

My first thought was to sample the value of timer0 which I'm using as the system clock. The issue there is that the overhead of housekeeping the performance counters would dwarf some of the shorter routines, particularly the interrupt service routines. Hmm...

I still had three spare pins and toggling a pin is about the simplest thing you can do, typically taking only 1 clock cycle. All I needed was a way of capturing the pin state changes, which was simple to do with my OpenBench Logic Sniffer, a low-cost open source logic analyser with a Java GUI that means it runs on Solaris, my development platform. I therefore used the three pins as follows:

  1. Toggled on just before the task manager calls a Task's canRun() method and off after the method completes. The canRun() method is used to poll the Tasks to find out the first one which is currently runnable.
  2. Toggled on just before the task manager calls a Task's run() method and off after the method completes. The run() method implements whatever functionality the task requires.
  3. Toggled on at the start of every interrupt service routine and off at the end.

Instrumenting the canRun() and run() methods was easy, I just needed to add a few lines to the task manager to toggle the pins at the appropriate point. Interrupts proved to be somewhat of a challenge however. The problem is that gcc automatically adds prologue and epilogue code to all ISRs to save and restore the registers used in the ISR. For short ISRs, this prologue and epilogue code makes up the bulk of the ISR, so it's important to capture how long it takes, yet if we simply add the pin toggle code to the start and end ISR we'll miss most of the time spent in the ISR, as that's spent executing the compiler-added epilogue and prologue code. Plus having to hand-edit all the ISRs to add the pin toggle instrumentation would be a pain, to say the least.

I therefore resorted to a bit of a hack. I redefined the ISR macro to use my own prologue and epilogue code, where the first and last things that were done was to toggle the appropriate IO pin. Because I didn't know which registers were going to be used by the ISR I had to save and restore all the ones that might possibly be used, unlike the compiler which knows which are actually used and can optimise the save/restores accordingly. I also needed to 'wrap' the body of the ISR in my instrumentation code, which meant calling the body of the original ISR as a subroutine rather than having it as inline code. The consequence of this was that the instrumented ISR was going to take longer than the un-instrumented one, but I decided I'd rather have overestimates of the time they took rather than underestimates. The macros I ended up with are below:

// Redefine the ISR macro to toggle the TRACE_ISR pin before/after each ISR
// body.  Also saves *all* registers and uses a RCALL to call the real ISR and
// will therefore be slower than the uninstrumented ISR as a result.
#undef ISR
#define ISR_VECTOR_BODY(V) \
  asm("push r1\n" \
      "push r0\n" \
      "in r0,__SREG__\n" \
      "push r0\n" \
      "clr __zero_reg__\n" \
      "push r18\n" \
      "push r19\n" \
      "push r20\n" \
      "push r21\n" \
      "push r22\n" \
      "push r23\n" \
      "push r24\n" \
      "push r25\n" \
      "push r26\n" \
      "push r27\n" \
      "push r30\n" \
      "push r31\n" \
  ); \
  V(); \
  asm("pop r31\n" \
      "pop r30\n" \
      "pop r27\n" \
      "pop r26\n" \
      "pop r25\n" \
      "pop r24\n" \
      "pop r23\n" \
      "pop r22\n" \
      "pop r21\n" \
      "pop r20\n" \
      "pop r19\n" \
      "pop r18\n" \
      "pop r0\n" \
      "out __SREG__,r0\n" \
      "pop r0\n" \
      "pop r1\n" \
  ); \
#ifdef __cplusplus
#define ISR(V, ...) \
  extern "C" void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern "C" void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)
#define ISR(V, ...) \
  extern void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)

Phew. There's a bit of C macro magic in there to generate an 'inner' routine that contains the body of the original ISR and call it. With that in place, I hooked up the OLS to the three monitoring pins and ran the application, tilting the accelerometer and tapping the piezo as I did so. Here's a small segment of an OLS trace of the application. The first channel is the canRun() trace - when the trace is high a canRun() method is running. The second channel is the run() trace - high when a run() method is running. Finally, the third trace is high when an ISR is running.

OLS trace 1

It's clear that most of the time is spent checking to see if there's anything that can run, and that is time that would otherwise be take up with doing 'real' work, so in effect the top trace represents idle time, split between the task manager internals and the various canRun() methods. I wrote a little program to analyse the time the application spent in each mode, and the results were really rather surprising:

  • scheduler: 38.83%
  • canRun: 56.02%
  • run: 3.16%
  • interrupts: 1.98%

So with all that processing going on, we are only using just over 5% of the available CPU. How is that possible! Well, it's fairly easy to explain. Most of the heavyweight processing is being done by dedicated hardware on the MCU - for example, the piezo is being sampled by the ADC hardware, and the ADC values are read in by an ISR rather than just triggering a conversion and spin-waiting for it to complete, as done by the standard Arduino analogRead function. As the Arduino documentation says "It takes about 100 microseconds (0.0001 s) to read an analog input". That's an awful lot of CPU cycles wasted in a spin loop, 1600 to be exact. Wherever possible my application uses ISRs to handle events, buffering up data in both directions and scheduling a Task to handle the parts that can't be done directly in an ISR.

To push up the load as much as possible I enabled all the debugging output I could. That results in about 20-50 characters of output every millisecond as the samples being read from the piezo are dumped out. Here are the results:

  • scheduler: 15.03%
  • canRun: 21.49%
  • run: 53.14%
  • interrupts: 10.21%

OK, now we are cooking, that's now using 63% of CPU - however that still means there's up to 37% unused CPU available! Looking at the corresponding OLS trace is informative:

OLS trace 1

Notice the big chunks where the run() channel (second one down) is high, what's going on there? Well, this trace was taken with the application's debugging turned up as high as possible so it was producing large amounts of serial output. My interrupt-driven serial IO code has an output buffer that's a limited size - we only have 2Kb of available RAM after all. When the buffer is full, serial writes spin wait until there is space in the buffer for the pending output. The only other options would be to throw away the output, which doesn't seem particularly useful. So what we are seeing isn't really 'real work', it's merely the time that's spent waiting for space to become available in the output buffer, which will happen as the serial ISR drains the buffer by copying the data to the USART. If that doesn't persuade you that it's a bad idea to use spin-waiting or polling in a tight loop as a way of implementing 'normal' program flow I don't know what will!

OK, what conclusions can we draw from this data? We'll I'd suggest at least these:

  • Wherever possible, use the hardware facilities of the AVR to offload work and help keep your application responsive.
  • Wherever possible, use interrupts to respond to external events. Also use them to handle output where there is a significant delay between each step of the output.
  • As far as possible, move anything other than trivial processing out of your ISRs into tasks which are prioritised relative to the other processing in your application.
  • Any kind of polling or spin-waiting is to be avoided as far as possible, the CPU cycles that such approaches consume can better be used for 'real' work.
  • Avoid using the standard Arduino libraries, because they are focused on simplicity and ease of use rather than performance.
  • Watch out for 'hidden' costs creeping into your code, such as the use of floating-point arithmetic when you don't absolutely need to use it - examining the symbol table of your application is a good way of figuring out if you've inadvertently triggered its use.
  • A better algorithm will always outperform micro-optimisation approaches such as hand-coded assembler - I'll discuss this further in a later post where I describe how I implemented the RGB LED handling.

So in summary that humble AVR chip is actually a far more capable device that you probably think it is. The Arduino environment is certainly a quick and easy way of getting something working, but you have to realise that convenience often comes at a very significant cost. Before deciding to replace your AVR with something that's more powerful (but usually more expensive and less ubiquitous) you should try to find out why your application is struggling. The Arduino ecosystem has brought the widespread availability of some very interesting and cheap hardware, not just the boards themselves but also all the surrounding add-ons that are available, and that's something you should not lightly forego. Just moving a poorly-performing application onto a faster platform won't necessarily make it any faster if it makes widespread use of non-performant approaches such as polling loops and spin-waiting. What will most likely happen is that you'll just end up burning the extra cycles the faster platform gives you waiting in various loops. At very least you need to understand why your application is slow at present before assuming that a faster platform will help. I hope I've given you one approach for doing minimally-intrusive instrumentation and even if you can't use the exact approach I've described here, careful instrumentation of your application will almost certainly show that your preconceptions about where all the cycles are being burned are wrong.

"Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is."
-- Rob Pike
Categories : Tech, AVR

Mio 687 satnav - a review

I bought a Mio 687 satnav for my wife at Christmas and I must say it's been a great disappointment. It fares very badly in comparison with the free satnav that's on my now ancient Nokia E71. Here's a list of the reasons I think you should avoid the Mio 687, and as many of the issues are software related, you should probably give the entire Mio range a miss.

  • The first unit I received had a faulty screen. The vendor (Lemon Digital) gave me a load of grief about charging me £30 if it wasn't faulty - which is actually illegal under UK distance selling law - and to add insult to injury they didn't refund me the £8 it cost me to return the broken one.
  • The Mio website was down for most of the two weeks after Christmas due to it not being able to cope with the post-Christmas load, so I couldn't update any of the software on the satnav - you only get a limited period to download the latest maps.
  • When I did update the software it promptly blew away my initial 3-month speed camera subscription, and despite a long email exchange with Mio tech support, they didn't manage to actually get it fixed before the three months was up.
  • When you upgrade the software it deletes all your saved locations, no warning.
  • If the download of an update fails it leaves the partially downloaded file lying around and all subsequent download attempts then fail. I had to figure out the fix myself, Mio support were no help - in fact, they are pretty hopeless.
  • The satnav quite often gets in a state if you update it from the PC. When you reboot the satnav it either tells you it was disconnected during sync or that there is a problem with the maps, even when neither of those things is true. The only way I found of fixing this was to do a Windows format of the satnav USB disk device and reinstall everything,
  • The 'smart restore' function in the PC software is neither. It gets to about 25% and then hangs forever. And once that happens, your satnav won't work. The only fix is to reformat the satnav entirely and reinstall from scratch.
  • Mio support sent me a link to an unpublished version of the software ( as a 'fix' for my problems, It installed ok, but when I used the satnav after installing it, every route turn was preceded by a voice instruction to either exit even if you were staying on the same road, or to do a u-turn if you were turning off.
  • The postcode search only uses the first 4 characters of the postcode you enter, you have to know the street name and number as well. Even the free Nokia satnav is better than that.
  • You can supposedly tether a mobile to the satnav so it can use to do google lookups. However the satnav only manages to pair with my Nokia about 1 time in 10, effectively making the feature useless.

I'm sure there are other things I've forgotten as well - I've had so many issues it's all become a bit of a blur, and I've lost count of how many times I've had to reformat and reinstall the bloody thing. I bought the Mio because it seemed like a good deal for the money, unfortunately not. I'd like to say I got what I paid for, but I don't consider £140 to be particularly cheap, although I would describe the 687 as rather nasty. If you need to buy a satnav, my advice would don't by anything from Mio, and don't use LemonDigital.

Categories : Personal, Tech

Grouse chicks

Grouse Chicks

I was out on patrol a couple of weeks ago on a rather blustery Saturday, and a hen grouse took off from right under my feet. Normally when they go up they call loudly, but this time the bird did the 'I have a broken wing' thing, so I guessed she might be on a nest. Sure enough, less than a meter away was a depression in the ground with around 8-10 chicks in it, all sitting perfectly still and quiet, and incredibly well camouflaged. I fumbled for my camera, but by the time I'd got it ready one of the chicks went 'chirup' and they all scattered from the nest. As it wasn't a particularly warm day I moved away quickly to let mum come back and gather them up, cursing my fumbling as I'd not got a picture.

About 15 minutes later, the same thing happened again - mum went up and there was another group of chicks, only about 4-5 this time, and I still had my camera out, took a quick shot and moved on as they'd already started to scatter. I think they are kinda cute :-)

Why I'm ditching the Arduino software platform

I'm getting set up for my next project and decided to update my development environment. I've finally decided to entirely ditch the Arduino software environment and just use the boards. I stopped using the Arduino IDE some time ago, but now I'm going whole hog and ditching the Arduino library as well. Why? Well, it's simple:

Significant parts of it are pile of junk.

I know that's a pretty strong statement, so I better back it up with evidence. OK, let's start with the hardware serial IO code. Before version 1.0 of the Arduino platform, although reading from the serial ports was interrupt-driven, writing wasn't. Rather, the code went into a spin loop, polling the transmit status bit until the USART was idle before sending the next character. Why was that a problem? Well if you wrote a 80-character string at 9600 baud it would take (8 bits + 1 start bit + 1 stop bit) * 80 / 9600 = 0.083, i.e. 83 milliseconds. That's a huge amount of time for the CPU to be spending just to do some output. I found a number of posts where people were complaining that doing reasonable amounts of IO screwed up all the other bits of their sketches, and no wonder. Admittedly the Arduino 1.0 release notes say that's been changed so that output now uses interrupts as well, but that's not the end of the problems.

Let's take a peek at the HardwareSerial.cpp class. First thing to note is that two 64-byte buffers are allocated for each USART, even if it isn't used. That's 128 bytes on a Duemilanove and 512 bytes on a Mega, or 6% and 12% of the available SRAM respectively. On the Duemilanove that's reasonable as there's only 1 UART, but on the Mega it represents a significant waste of precious memory when only 1 USART is normally going to be in use.

OK, let's look at the new write() function that does interrupt-driven output:

size_t HardwareSerial::write(uint8_t c)
  int i = (_tx_buffer->head + 1) % SERIAL_BUFFER_SIZE;

  // If the output buffer is full, there's nothing for it other than to
  // wait for the interrupt handler to empty it a bit
  // ???: return 0 here instead?
  while (i == _tx_buffer->tail)

  _tx_buffer->buffer[_tx_buffer->head] = c;
  _tx_buffer->head = i;

  sbi(*_ucsrb, _udrie);
  return 1;

Is there a problem? Let's look at the definition of _tx_buffer:

struct ring_buffer
  unsigned char buffer[SERIAL_BUFFER_SIZE];
  volatile unsigned int head;
  volatile unsigned int tail;

Oh dear. head and tail are declared as int, i.e. 16 bits, 2 bytes. They are accessed by both the write routine and the interrupt service routine that actually transmits the data yet there's no locking in the write routine so the accesses aren't atomic. Why is that an issue? Well, the avr-libc documentation makes it clear:

A typical example that requires atomic access is a 16 (or more) bit variable that is shared between the main execution path and an ISR. While declaring such a variable as volatile ensures that the compiler will not optimize accesses to it away, it does not guarantee atomic access to it.

The documentation goes on to explain the sorts of symptoms you'll see if you ignore this, follow the link above if you want the full details. This is inexcusably shoddy code - the constraints on accessing variables that are shared between ISR and non-ISR code are well-known. What really concerns me is that people will use the Arduino code as an example of 'good' AVR code and it isn't, in many places it's frankly awful.

"So what?" you say, "That's only one chunk of code that's a bit naff." Unfortunately it's not an isolated instance. Let's move on now to look at one of the newer features that has been added to the Arduino platform, the re-implemented String class. Ok, let's build a minimal program that uses it:

#include "WString.h"
int main(void) {
    String bloat = "hello world";
    return 0;

And let's build it:

WString.cpp: In member function ‘int String::lastIndexOf(char, unsigned int) const’:
WString.cpp:503:38: error: comparison of unsigned expression < 0 is always false [-Werror=type-limits]
WString.cpp: In member function ‘int String::lastIndexOf(const String&, unsigned int) const’:
WString.cpp:519:63: error: comparison of unsigned expression < 0 is always false [-Werror=type-limits]

Sigh. One would think that the Arduino developers would at least turn on warnings when they are compiling their code, but they don't. And in this case, the consequence is a bug. So, temporarily comment out the offending lines so we get a successful build, and:

/opt/avr-gcc/bin/avr-size build/test.elf
   text    data     bss     dec     hex filename
  10194      20       5   10219    27eb build/test.elf

Can that really be right? 10K for a one-line program? Unfortunately it is. Any mention of String pulls in the entirety of the class, as well as all the other avr-libc routines it references. So on a Duemilanove that only has 32k to start with, a third of the available memory is gone before you start. At the time the class was being rewritten I expressed my opinion that it was probably a bad idea and that the Arduino developers really needed to target the platform they actually had and not the one they wished they had. And that's not the end of the issues with the String class - on a constrained-memory platform such as the AVR, providing a class like String that relies on malloc, creates lots of temporaries, fragments the (tiny) heap and has no real ability to deal with out-of-memory conditions is a recipe for problems, problems that will manifest themselves as random, mysterious and un-diagnosable run-time errors. And sure enough, a quick google shows that's exactly what tends to happen - just about the worst possible outcome for a platform that's targeted at neophytes.

That's just two examples - there are others as well, such as the well-known performance problems with pin access, which may be up to 50x slower that direct pin access. In fact the only two remaining parts of the Arduino libraries that I still use are the millisecond clock and the serial IO, and they are easy enough to replace, so that's what I'm doing.

While I applaud the aims of the Arduino project, the realities of the restricted hardware platform have to be taken into consideration. In addition, one of the aims of the project is to:

provide a well-designed, maintainable, and stable platform for the future
and despite its unquestionable success on many other fronts, on that one I feel the Arduino platform is less than entirely successful. I for one won't be using any of the software any more, it's just not what I consider to be acceptable quality.

Categories : Tech, AVR