AVR performance monitoring using the OpenBench Logic Sniffer

Now I've ditched the Arduino platform I wondered just how much 'juice' I could squeeze out of a 16Mhz ATMega328P, as used in the Arduino Duemilanove. According to Atmel it should be capable of around 16 MIPS, which puts it ahead of a 1985-vintage Intel 80386DX which could do about 11 MIPS at 33Mhz, albeit the 80386 is a 32-bit processor.

Accordingly, I set about sticking as many things from my parts box as I could onto my duemilanove board. I ended up with:

  • A piezo sensor, signal preconditioned with this circuit and being sampled with the ADC every msec. The sampled signal was further processed to do peak detection, trigger level quantization, hysteresis and double/triple beat detection.
  • Bit-banging a 16-LED chain driven by two daisy-chained TLC5916 LED drivers, animating the LEDs based on the piezo trigger events.
  • Reading (X,Y,Z) information from a LIS302DL accelerometer every 10 msec, over SPI running at 4Mhz. The Y axis data was a bit noisy so I implemented an exponential moving average to smooth the values. I did this using fixed-point arithmetic rather than floating point - on the AVR, floating point arithmetic has to be done in software so it's expensive, I'll describe how I did this in a later post.
  • Bit-banging another 16-LED chain as above to display the output of the accelerometer as a 'moving dot' that goes back and forth as the accelerometer is tilted.
  • Doing continuous fades of a RGB LED, 8 bits per colour. The fades were implemented entirely in software at an effective PWM clock rate of 31KHz, although I used a more efficient implementation than 'classic' PWM - I'll cover how I did that in a subsequent post.
  • A serial monitor to allow various parameters to be adjusted, e.g. piezo trigger levels, and allowing debugging output to be enabled and disabled. The baud rate was set to 115 Kbaud.

All this was implemented with a mix of interrupts and a variant of my task manager. Even with the maximum level of debugging enabled, everything continued working smoothly. However I had no real idea of how much headroom the CPU still had left, so I thought about how I might instrument the application to find out.

My first thought was to sample the value of timer0 which I'm using as the system clock. The issue there is that the overhead of housekeeping the performance counters would dwarf some of the shorter routines, particularly the interrupt service routines. Hmm...

I still had three spare pins and toggling a pin is about the simplest thing you can do, typically taking only 1 clock cycle. All I needed was a way of capturing the pin state changes, which was simple to do with my OpenBench Logic Sniffer, a low-cost open source logic analyser with a Java GUI that means it runs on Solaris, my development platform. I therefore used the three pins as follows:

  1. Toggled on just before the task manager calls a Task's canRun() method and off after the method completes. The canRun() method is used to poll the Tasks to find out the first one which is currently runnable.
  2. Toggled on just before the task manager calls a Task's run() method and off after the method completes. The run() method implements whatever functionality the task requires.
  3. Toggled on at the start of every interrupt service routine and off at the end.

Instrumenting the canRun() and run() methods was easy, I just needed to add a few lines to the task manager to toggle the pins at the appropriate point. Interrupts proved to be somewhat of a challenge however. The problem is that gcc automatically adds prologue and epilogue code to all ISRs to save and restore the registers used in the ISR. For short ISRs, this prologue and epilogue code makes up the bulk of the ISR, so it's important to capture how long it takes, yet if we simply add the pin toggle code to the start and end ISR we'll miss most of the time spent in the ISR, as that's spent executing the compiler-added epilogue and prologue code. Plus having to hand-edit all the ISRs to add the pin toggle instrumentation would be a pain, to say the least.

I therefore resorted to a bit of a hack. I redefined the ISR macro to use my own prologue and epilogue code, where the first and last things that were done was to toggle the appropriate IO pin. Because I didn't know which registers were going to be used by the ISR I had to save and restore all the ones that might possibly be used, unlike the compiler which knows which are actually used and can optimise the save/restores accordingly. I also needed to 'wrap' the body of the ISR in my instrumentation code, which meant calling the body of the original ISR as a subroutine rather than having it as inline code. The consequence of this was that the instrumented ISR was going to take longer than the un-instrumented one, but I decided I'd rather have overestimates of the time they took rather than underestimates. The macros I ended up with are below:

// Redefine the ISR macro to toggle the TRACE_ISR pin before/after each ISR
// body.  Also saves *all* registers and uses a RCALL to call the real ISR and
// will therefore be slower than the uninstrumented ISR as a result.
#undef ISR
#define ISR_VECTOR_BODY(V) \
  SET_BIT_HI(TRACE_ISR_PORT, TRACE_ISR_PIN); \
  asm("push r1\n" \
      "push r0\n" \
      "in r0,__SREG__\n" \
      "push r0\n" \
      "clr __zero_reg__\n" \
      "push r18\n" \
      "push r19\n" \
      "push r20\n" \
      "push r21\n" \
      "push r22\n" \
      "push r23\n" \
      "push r24\n" \
      "push r25\n" \
      "push r26\n" \
      "push r27\n" \
      "push r30\n" \
      "push r31\n" \
  ); \
  V(); \
  asm("pop r31\n" \
      "pop r30\n" \
      "pop r27\n" \
      "pop r26\n" \
      "pop r25\n" \
      "pop r24\n" \
      "pop r23\n" \
      "pop r22\n" \
      "pop r21\n" \
      "pop r20\n" \
      "pop r19\n" \
      "pop r18\n" \
      "pop r0\n" \
      "out __SREG__,r0\n" \
      "pop r0\n" \
      "pop r1\n" \
  ); \
  SET_BIT_LO(TRACE_ISR_PORT, TRACE_ISR_PIN); \
  reti()
#ifdef __cplusplus
#define ISR(V, ...) \
  extern "C" void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern "C" void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)
#else
#define ISR(V, ...) \
  extern void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)
#endif

Phew. There's a bit of C macro magic in there to generate an 'inner' routine that contains the body of the original ISR and call it. With that in place, I hooked up the OLS to the three monitoring pins and ran the application, tilting the accelerometer and tapping the piezo as I did so. Here's a small segment of an OLS trace of the application. The first channel is the canRun() trace - when the trace is high a canRun() method is running. The second channel is the run() trace - high when a run() method is running. Finally, the third trace is high when an ISR is running.

OLS trace 1

It's clear that most of the time is spent checking to see if there's anything that can run, and that is time that would otherwise be take up with doing 'real' work, so in effect the top trace represents idle time, split between the task manager internals and the various canRun() methods. I wrote a little program to analyse the time the application spent in each mode, and the results were really rather surprising:

  • scheduler: 38.83%
  • canRun: 56.02%
  • run: 3.16%
  • interrupts: 1.98%

So with all that processing going on, we are only using just over 5% of the available CPU. How is that possible! Well, it's fairly easy to explain. Most of the heavyweight processing is being done by dedicated hardware on the MCU - for example, the piezo is being sampled by the ADC hardware, and the ADC values are read in by an ISR rather than just triggering a conversion and spin-waiting for it to complete, as done by the standard Arduino analogRead function. As the Arduino documentation says "It takes about 100 microseconds (0.0001 s) to read an analog input". That's an awful lot of CPU cycles wasted in a spin loop, 1600 to be exact. Wherever possible my application uses ISRs to handle events, buffering up data in both directions and scheduling a Task to handle the parts that can't be done directly in an ISR.

To push up the load as much as possible I enabled all the debugging output I could. That results in about 20-50 characters of output every millisecond as the samples being read from the piezo are dumped out. Here are the results:

  • scheduler: 15.03%
  • canRun: 21.49%
  • run: 53.14%
  • interrupts: 10.21%

OK, now we are cooking, that's now using 63% of CPU - however that still means there's up to 37% unused CPU available! Looking at the corresponding OLS trace is informative:

OLS trace 1

Notice the big chunks where the run() channel (second one down) is high, what's going on there? Well, this trace was taken with the application's debugging turned up as high as possible so it was producing large amounts of serial output. My interrupt-driven serial IO code has an output buffer that's a limited size - we only have 2Kb of available RAM after all. When the buffer is full, serial writes spin wait until there is space in the buffer for the pending output. The only other options would be to throw away the output, which doesn't seem particularly useful. So what we are seeing isn't really 'real work', it's merely the time that's spent waiting for space to become available in the output buffer, which will happen as the serial ISR drains the buffer by copying the data to the USART. If that doesn't persuade you that it's a bad idea to use spin-waiting or polling in a tight loop as a way of implementing 'normal' program flow I don't know what will!

OK, what conclusions can we draw from this data? We'll I'd suggest at least these:

  • Wherever possible, use the hardware facilities of the AVR to offload work and help keep your application responsive.
  • Wherever possible, use interrupts to respond to external events. Also use them to handle output where there is a significant delay between each step of the output.
  • As far as possible, move anything other than trivial processing out of your ISRs into tasks which are prioritised relative to the other processing in your application.
  • Any kind of polling or spin-waiting is to be avoided as far as possible, the CPU cycles that such approaches consume can better be used for 'real' work.
  • Avoid using the standard Arduino libraries, because they are focused on simplicity and ease of use rather than performance.
  • Watch out for 'hidden' costs creeping into your code, such as the use of floating-point arithmetic when you don't absolutely need to use it - examining the symbol table of your application is a good way of figuring out if you've inadvertently triggered its use.
  • A better algorithm will always outperform micro-optimisation approaches such as hand-coded assembler - I'll discuss this further in a later post where I describe how I implemented the RGB LED handling.

So in summary that humble AVR chip is actually a far more capable device that you probably think it is. The Arduino environment is certainly a quick and easy way of getting something working, but you have to realise that convenience often comes at a very significant cost. Before deciding to replace your AVR with something that's more powerful (but usually more expensive and less ubiquitous) you should try to find out why your application is struggling. The Arduino ecosystem has brought the widespread availability of some very interesting and cheap hardware, not just the boards themselves but also all the surrounding add-ons that are available, and that's something you should not lightly forego. Just moving a poorly-performing application onto a faster platform won't necessarily make it any faster if it makes widespread use of non-performant approaches such as polling loops and spin-waiting. What will most likely happen is that you'll just end up burning the extra cycles the faster platform gives you waiting in various loops. At very least you need to understand why your application is slow at present before assuming that a faster platform will help. I hope I've given you one approach for doing minimally-intrusive instrumentation and even if you can't use the exact approach I've described here, careful instrumentation of your application will almost certainly show that your preconceptions about where all the cycles are being burned are wrong.

"Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is."
-- Rob Pike
Categories : Tech, AVR