AVR performance monitoring using the OpenBench Logic Sniffer

Now I've ditched the Arduino platform I wondered just how much 'juice' I could squeeze out of a 16Mhz ATMega328P, as used in the Arduino Duemilanove. According to Atmel it should be capable of around 16 MIPS, which puts it ahead of a 1985-vintage Intel 80386DX which could do about 11 MIPS at 33Mhz, albeit the 80386 is a 32-bit processor.

Accordingly, I set about sticking as many things from my parts box as I could onto my duemilanove board. I ended up with:

  • A piezo sensor, signal preconditioned with this circuit and being sampled with the ADC every msec. The sampled signal was further processed to do peak detection, trigger level quantization, hysteresis and double/triple beat detection.
  • Bit-banging a 16-LED chain driven by two daisy-chained TLC5916 LED drivers, animating the LEDs based on the piezo trigger events.
  • Reading (X,Y,Z) information from a LIS302DL accelerometer every 10 msec, over SPI running at 4Mhz. The Y axis data was a bit noisy so I implemented an exponential moving average to smooth the values. I did this using fixed-point arithmetic rather than floating point - on the AVR, floating point arithmetic has to be done in software so it's expensive, I'll describe how I did this in a later post.
  • Bit-banging another 16-LED chain as above to display the output of the accelerometer as a 'moving dot' that goes back and forth as the accelerometer is tilted.
  • Doing continuous fades of a RGB LED, 8 bits per colour. The fades were implemented entirely in software at an effective PWM clock rate of 31KHz, although I used a more efficient implementation than 'classic' PWM - I'll cover how I did that in a subsequent post.
  • A serial monitor to allow various parameters to be adjusted, e.g. piezo trigger levels, and allowing debugging output to be enabled and disabled. The baud rate was set to 115 Kbaud.

All this was implemented with a mix of interrupts and a variant of my task manager. Even with the maximum level of debugging enabled, everything continued working smoothly. However I had no real idea of how much headroom the CPU still had left, so I thought about how I might instrument the application to find out.

My first thought was to sample the value of timer0 which I'm using as the system clock. The issue there is that the overhead of housekeeping the performance counters would dwarf some of the shorter routines, particularly the interrupt service routines. Hmm...

I still had three spare pins and toggling a pin is about the simplest thing you can do, typically taking only 1 clock cycle. All I needed was a way of capturing the pin state changes, which was simple to do with my OpenBench Logic Sniffer, a low-cost open source logic analyser with a Java GUI that means it runs on Solaris, my development platform. I therefore used the three pins as follows:

  1. Toggled on just before the task manager calls a Task's canRun() method and off after the method completes. The canRun() method is used to poll the Tasks to find out the first one which is currently runnable.
  2. Toggled on just before the task manager calls a Task's run() method and off after the method completes. The run() method implements whatever functionality the task requires.
  3. Toggled on at the start of every interrupt service routine and off at the end.

Instrumenting the canRun() and run() methods was easy, I just needed to add a few lines to the task manager to toggle the pins at the appropriate point. Interrupts proved to be somewhat of a challenge however. The problem is that gcc automatically adds prologue and epilogue code to all ISRs to save and restore the registers used in the ISR. For short ISRs, this prologue and epilogue code makes up the bulk of the ISR, so it's important to capture how long it takes, yet if we simply add the pin toggle code to the start and end ISR we'll miss most of the time spent in the ISR, as that's spent executing the compiler-added epilogue and prologue code. Plus having to hand-edit all the ISRs to add the pin toggle instrumentation would be a pain, to say the least.

I therefore resorted to a bit of a hack. I redefined the ISR macro to use my own prologue and epilogue code, where the first and last things that were done was to toggle the appropriate IO pin. Because I didn't know which registers were going to be used by the ISR I had to save and restore all the ones that might possibly be used, unlike the compiler which knows which are actually used and can optimise the save/restores accordingly. I also needed to 'wrap' the body of the ISR in my instrumentation code, which meant calling the body of the original ISR as a subroutine rather than having it as inline code. The consequence of this was that the instrumented ISR was going to take longer than the un-instrumented one, but I decided I'd rather have overestimates of the time they took rather than underestimates. The macros I ended up with are below:

// Redefine the ISR macro to toggle the TRACE_ISR pin before/after each ISR
// body.  Also saves *all* registers and uses a RCALL to call the real ISR and
// will therefore be slower than the uninstrumented ISR as a result.
#undef ISR
#define ISR_VECTOR_BODY(V) \
  asm("push r1\n" \
      "push r0\n" \
      "in r0,__SREG__\n" \
      "push r0\n" \
      "clr __zero_reg__\n" \
      "push r18\n" \
      "push r19\n" \
      "push r20\n" \
      "push r21\n" \
      "push r22\n" \
      "push r23\n" \
      "push r24\n" \
      "push r25\n" \
      "push r26\n" \
      "push r27\n" \
      "push r30\n" \
      "push r31\n" \
  ); \
  V(); \
  asm("pop r31\n" \
      "pop r30\n" \
      "pop r27\n" \
      "pop r26\n" \
      "pop r25\n" \
      "pop r24\n" \
      "pop r23\n" \
      "pop r22\n" \
      "pop r21\n" \
      "pop r20\n" \
      "pop r19\n" \
      "pop r18\n" \
      "pop r0\n" \
      "out __SREG__,r0\n" \
      "pop r0\n" \
      "pop r1\n" \
  ); \
#ifdef __cplusplus
#define ISR(V, ...) \
  extern "C" void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern "C" void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)
#define ISR(V, ...) \
  extern void V(void) __attribute__ ((signal, __INTR_ATTRS)) ISR_NAKED __VA_ARGS__; \
  extern void V ## _inner(void); \
  void V(void) { ISR_VECTOR_BODY(V ## _inner); } \
  void V ## _inner(void)

Phew. There's a bit of C macro magic in there to generate an 'inner' routine that contains the body of the original ISR and call it. With that in place, I hooked up the OLS to the three monitoring pins and ran the application, tilting the accelerometer and tapping the piezo as I did so. Here's a small segment of an OLS trace of the application. The first channel is the canRun() trace - when the trace is high a canRun() method is running. The second channel is the run() trace - high when a run() method is running. Finally, the third trace is high when an ISR is running.

OLS trace 1

It's clear that most of the time is spent checking to see if there's anything that can run, and that is time that would otherwise be take up with doing 'real' work, so in effect the top trace represents idle time, split between the task manager internals and the various canRun() methods. I wrote a little program to analyse the time the application spent in each mode, and the results were really rather surprising:

  • scheduler: 38.83%
  • canRun: 56.02%
  • run: 3.16%
  • interrupts: 1.98%

So with all that processing going on, we are only using just over 5% of the available CPU. How is that possible! Well, it's fairly easy to explain. Most of the heavyweight processing is being done by dedicated hardware on the MCU - for example, the piezo is being sampled by the ADC hardware, and the ADC values are read in by an ISR rather than just triggering a conversion and spin-waiting for it to complete, as done by the standard Arduino analogRead function. As the Arduino documentation says "It takes about 100 microseconds (0.0001 s) to read an analog input". That's an awful lot of CPU cycles wasted in a spin loop, 1600 to be exact. Wherever possible my application uses ISRs to handle events, buffering up data in both directions and scheduling a Task to handle the parts that can't be done directly in an ISR.

To push up the load as much as possible I enabled all the debugging output I could. That results in about 20-50 characters of output every millisecond as the samples being read from the piezo are dumped out. Here are the results:

  • scheduler: 15.03%
  • canRun: 21.49%
  • run: 53.14%
  • interrupts: 10.21%

OK, now we are cooking, that's now using 63% of CPU - however that still means there's up to 37% unused CPU available! Looking at the corresponding OLS trace is informative:

OLS trace 1

Notice the big chunks where the run() channel (second one down) is high, what's going on there? Well, this trace was taken with the application's debugging turned up as high as possible so it was producing large amounts of serial output. My interrupt-driven serial IO code has an output buffer that's a limited size - we only have 2Kb of available RAM after all. When the buffer is full, serial writes spin wait until there is space in the buffer for the pending output. The only other options would be to throw away the output, which doesn't seem particularly useful. So what we are seeing isn't really 'real work', it's merely the time that's spent waiting for space to become available in the output buffer, which will happen as the serial ISR drains the buffer by copying the data to the USART. If that doesn't persuade you that it's a bad idea to use spin-waiting or polling in a tight loop as a way of implementing 'normal' program flow I don't know what will!

OK, what conclusions can we draw from this data? We'll I'd suggest at least these:

  • Wherever possible, use the hardware facilities of the AVR to offload work and help keep your application responsive.
  • Wherever possible, use interrupts to respond to external events. Also use them to handle output where there is a significant delay between each step of the output.
  • As far as possible, move anything other than trivial processing out of your ISRs into tasks which are prioritised relative to the other processing in your application.
  • Any kind of polling or spin-waiting is to be avoided as far as possible, the CPU cycles that such approaches consume can better be used for 'real' work.
  • Avoid using the standard Arduino libraries, because they are focused on simplicity and ease of use rather than performance.
  • Watch out for 'hidden' costs creeping into your code, such as the use of floating-point arithmetic when you don't absolutely need to use it - examining the symbol table of your application is a good way of figuring out if you've inadvertently triggered its use.
  • A better algorithm will always outperform micro-optimisation approaches such as hand-coded assembler - I'll discuss this further in a later post where I describe how I implemented the RGB LED handling.

So in summary that humble AVR chip is actually a far more capable device that you probably think it is. The Arduino environment is certainly a quick and easy way of getting something working, but you have to realise that convenience often comes at a very significant cost. Before deciding to replace your AVR with something that's more powerful (but usually more expensive and less ubiquitous) you should try to find out why your application is struggling. The Arduino ecosystem has brought the widespread availability of some very interesting and cheap hardware, not just the boards themselves but also all the surrounding add-ons that are available, and that's something you should not lightly forego. Just moving a poorly-performing application onto a faster platform won't necessarily make it any faster if it makes widespread use of non-performant approaches such as polling loops and spin-waiting. What will most likely happen is that you'll just end up burning the extra cycles the faster platform gives you waiting in various loops. At very least you need to understand why your application is slow at present before assuming that a faster platform will help. I hope I've given you one approach for doing minimally-intrusive instrumentation and even if you can't use the exact approach I've described here, careful instrumentation of your application will almost certainly show that your preconceptions about where all the cycles are being burned are wrong.

"Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is."
-- Rob Pike
Categories : Tech, AVR

Re: AVR performance monitoring using the OpenBench Logic Sniffer

Excellent post!  I'm in a very similar position.  I'm still using the Arduino IDE, but I'm moving to custom boards, using ISP to program, rather than Arduino boards, mostly using smaller AVRs.  I've been realizing how much overhead there is in a lot of the Arduino functions, so I've been planning to move to using Atmel Studio for some of my projects.  This post really makes it apparent how much of a difference that could make.

Re: AVR performance monitoring using the OpenBench Logic Sniffer

Really nice writeup, it illustrates what actual bare-metal performance is really about and really like, as opposed to layers of drivers / API like, say, the Arduino environment. I personally work with PIC MCUs rather than AVRs, but CPU idle "slack" (or lack thereof) is much the same there too obviously. To be fair I never actually worked with Arduinos, but I understand they target sparse I/O on human timescale (wiggle a pin here or there sometimes, take an A/D reading every forthnight or so, adjust a servo setting, etc.) which I suppose they manage to do quite adequately (well, programming blunders aside).

The thing is, getting used to this sort of (non)performance and blind wasting of perfectly good MCU cycles leaves one quite oblivious to what that hardware is _actually_ capable of (like, say, generating a real-time composite video signal by bit-banging AND the music to go with it AND reading two gamepads AND casually implementing the game logic itself on top of all that), and I always considered that a sad state of affairs. Being aware of that sort of thing is conducive to a lot less surprise upon adorning an MCU with various I/O like a christmas tree only to find that meanwhile said MCU is actually dying of boredom. Which is why I'm happy to read things like your writeup that help shed some light on such matters... :)

As for me, I guess my mantra can be summed up mostly as "Never wait for anything. Ever." State machines are your friend, blocking calls are the enemy. Conversely, anything that you can trigger (detect by interrupt) instead of polling for it all the time, you should. Also, there is never a buffer big enough (at least on MCUs) to actually be guaranteed never to clog up under the right circumstances, so any sort of serial debug / output is, well, not your friend - as you yourself noted. I prefer to throw away stuff if I have to, but even more to make sure I never try to use a peripheral that's not a "fat enough pipe" that I can never fully saturate it with the aid of a modest buffer, in that particular application. Even so, flow control might force me at any time to throw stuff away - short of a magic bottomless buffer, nothing can guarantee I won't ever need to discard data; the best I can do is make a note that it happened. Regarding debugging, there's rarely a condition special enough that one cannot debug by toggling the proverbial LED on a spare pin. If one must (and there are enough spare pins), outputting a byte to a port is the "speed of light" of debugging - nothing can ever be faster, it never clogs up, and it can be picked up nicely in the outside world with a logic analyzer as mentioned above (OLS does display hex values for 8-pin trace groups and one could even write a dedicated plugin to decode those states as text, nicely timestamped).

Also, as you have already mentioned, looking at the code you generate (not necessarily at single machine instruction level) does indeed help you see what eats your CPU cycles (almost always unjustifiedly) - for instance, I have a really bad experience with indexed adressing: every X[Y] style access to an array seems to generate dozens of instructions (repeated from zero on the following line, then the following one, if they keep manipulating X[ ]), when a judiciously allocated buffer on a binary threshold could have been accessed in 2-3 instructions. It follows that a CommBuffers[SIZE, 2] works a lot faster divided as a CommRxBuffer[SIZE] plus a CommTxBuffer[SIZE]. Also, annoyingly, the same buffer implemented with direct pointer access actually seems to work even faster than that most of the time (yeah ok call me paranoid, I don't really trust pointers...).

Re: AVR performance monitoring using the OpenBench Logic Sniffer

One of the problems with the Arduino environment is it almost forces you into using delays for timing, as I detailed in a previous post, so using it to deal with any sort of concurrency is often difficult. That limitation has sometimes resulted in the AVR MCU being labelled as slow and not very capable when in fact it's a consequence of the Arduino software. I'd have more time for the Arduino software project if they were honest and open about the limitations of the environment. Getting people using Arduino who wouldn't otherwise have a way of doing Physical Computing is a laudable goal. However there doesn't seem to be anything anywhere in the project that says "Now you've learned the Arduino environment, here is where you should look to learn how to step beyond it" - any decent learning system will always encourage people to grow beyond it's bounds. Instead of which the Arduino project has occupied itself by attempting discourage people from really undersanding what is happening in their programs with idiocies such as the StringObject class. There's an unforgivably misleading statement in the documentation - "It takes more memory than a simple character array, but it is also more useful." which should actually be "Using this class in anything other than a trivial way will consume almost all of the very limited memory you have and will almost certainly cause your program to fail in random and undiagnosable ways". It isn't helpful to hide the truth from people, and the truth is that MCUs are a very constrained environment in which things such as heap-allocating copy-constructoring C++ classes are inappropriate.

As for your comment about pointer access, yes in a significant number of cases it be faster. An array index has to be converted into base address + (index * size of array element) whereas a pointer is a direct access. gcc is pretty good at optimising things where it can, e.g. a 'for' loop iterating over an array can be optimised to use a pointer and an increment, but that's not always going to be the case. A thorough understanding of pointers is an essential C/C++ skill :-)

Re: AVR performance monitoring using the OpenBench Logic Sniffer

PS - regarding optimized assembly vs. standard C: I found that AES encryption (which one should never use without an actual proper hardware encryption unit, unless one is prepared to regress to Arduino-like performance levels)  - despite being something one can implement in a really limited number of ways - as supplied in the official AVR C support libraries somewhere can be rewritten at least a full order of magnitude faster in assembly. And that's one place it really matters, considering any application making use of AES will be doing bazillions of such operations...

Re: AVR performance monitoring using the OpenBench Logic Sniffer

Interestingly, the XMEGA has a specific instruction for optimising the DES algorithm, see AVR1317 application note.

Re: AVR performance monitoring using the OpenBench Logic Sniffer

Alan, this is great stuff. I was able to get your example code running and monitoring the performance with my PoScope Logic Analyzer just as you described above. I'd like to take it to the next level with a simple means of detecting a button within a task (not an interrupt just yet), and then have that set another task as runnable. In the larger context of your program with all the I/O, how did you pass information between the various tasks? Did you use global variables and/or states?