# Merging Mercurial repositories

A quick note because if I ever need to do this again I know I'll have forgotten how :-) I had three Mercurial repos that were parts of the same thing, a web-based browser/searcher for engineering documents at work. The three parts were all NetBeans projects that were interrelated, so having them as separate repos didn't make much sense. However, merging them into a single, non-branched repo whilst maintaining history wasn't particularly straightforward, I couldn't find any description of how to do this that didn't end up with branches in the resulting repo, and mercurial extensions such as mq didn't seem to cut it either. After some experimentation, I came up with the following sequence of steps which did what I wanted:

- Use the Mercurial Convert extension to move each project one directory level down so they didn't clash when merged. This uses the
`--filemap`

flag with a simple filemap of the form. NewProjSubDir

to move each project into a subdirectory:$ hg convert --filemap filemap orig1 moved1

- Pull all the intermediate moved projects into a new repo:
$ hg clone moved1 merged $ cd merged $ hg pull -f ../moved2 $ hg pull -f ../moved3

The`-f`

flag is needed because the repos are unrelated. This will result in a repo with a branch for each source repo, with each branch sorted in date order but the repo overall won't be date sorted. - Reorder the history so it is sorted in date order, again using the Convert extension to do that:
$ hg convert --datesort merged merged1

- Reparent each changeset in the merged, sorted repo so that its parent is the immediately preceding one, in date order. The Convert extension's
`--splicemap`

feature is used to do this, along with a small perl script to create the splicemap:#!/bin/perl use strict; use warnings; my ($line, $parent); while (defined($line = <>)) { if ($line =~ m{\bchangeset:\s+\d+:([\dabcdef]+)\b}) { my $child = $1; if (defined($parent)) { printf("%s %s\n", $child, $parent); } $parent = $child; } }

Then:$ cd merged1 $ hg glog --debug | perl ../splicemap.pl > ../splicemap $ cd .. $ hg convert --splicemap splicemap merged1 merged2

The resulting merged2 repo will have the merged, date-sorted and branchless union of the contents of the original repos.

# Rohan? Rubbish

Rohan clothing = expensive, utter junk. Over the last couple of years I've been bought two Rohan products as presents by my mother, a Rohan Daybreak down vest and a Rohan long sleeved shirt. About one inch of the vest zip has pulled clean out of the rest of the garment half way down the front. On careful examination I can see that about an inch of the back edge of the zip has been cut off during manufacture and as a result the zip fabric has shredded and pulled out of the garment. That's already in the bin. The shirt has turned into a pilled static-laden mess and is also destined for the bin as it's unwearable,

Considering how much the clothing costs the quality is awful. The products may look nice on a hanger, or if all you want is to ponce around in a wine bar, but if my experience is anything to go by they clearly aren't serious outdoor equipment. Anyone thinking about buying Rohan should save their money and buy something else, you can buy something for less than half the price and still end up with significantly better quality.

Save your money, avoid http://www.rohan.co.uk/.

# Linux Mint LMDE, Xfce, Toshiba R700, suspend-resume and backlight keys

I've put Linux Mint Debian Edition on my Toshiba R700 laptop and I'm using the Xfce window manager. Initially everything seemed to work fine, but after a suspend-resume cycle the brightness keys no longer worked (Fn-F6 and Fn-F7) although everything else did. A fruitless search for a solution across the intertubes then ensued. It appears that the original problem was caused because somebody decided they'd rip support for a feature they wanted to deprecate out of the Toshiba ACPI driver more-or-less "to see what broke" - nice. Well, the answer is "lots of things" if the number of reports of this problem that I've found are any indication.

I found a lot of suggested workarounds and without going into the gory details, none of them actually worked and in fact they generally made things worse, for example making the backlight not come on at all after resume. I did get a few clues as to how to work around the problem by diddling the backlight driver files under `/sys/class/backlight`

, specifically the ones in the `/usr/local/bin`

and used the Keyboard settings manager to map the <Super>F7 and <Super>F6 keys to calls to "backlight -u" and "backlight -d" to adjust the brightness up and down. This works both before and after suspend-resume and I've set it up to have 16 different brightness levels as I find the standard 8 a bit too wide apart. In case it helps someone else I've put the source here and a precompiled 64-bit binary here - if you need the 32-bit version you'll have to build from source. Enjoy!

# Driving the WS2811 at 800KHz with a 16MHz AVR

Recently some of the folks at the Manchester Hackspace did a bulk-order of WS8211 LED pixels. These are available from several vendors on aliexpress.com (search for "WS2811 5050 RGB") and they combine a 5050 RGB LED with a WS2811 driver chip. The WS2811 provides 24-bit RGB colour plus constant-current output, so no external components are required, although the datasheet does suggest the addition of some impedance matching and decoupling resistors and capacitors. The WS2811 can run at a data rate of either 400KHz or 800KHz, although the 800KHz ones seem more common. Having got hold of some, I set about getting them working using a Minumus 32 board which has an Atmel ATmega32U2 running at 16MHz.

Note also that this part, the WS2811, is commonly confused with one with a very similar name, the WS2801, but they are radically different beasts. The WS2801 has a SPI interface which means you need to provide both a clock and a data signal. That in turn means you can send data at a wide range of speeds (the WS2801 datasheet says up to 25MHz) and still have everything work fine as the clock line signals the WS2801 when to sample the data line. Plus most MCUs have hardware SPI which makes driving the WS2801 pretty much a doddle.

The WS2811 oh the other hand uses a rather unusual control scheme. It uses a single combined clock and data line. You reset the chain by keeping the input low for around 50usec (less will usually work as well), then start sending 24-bit RGB sequences in a continuous stream. The first LED in the chain displays the first RGB value to be sent and passes the rest along the chain, the second displays the second value and so on. The datasheet gives the required timings, and there are a couple of writeups here and here. Mostly these are talking about the WS2811 in low speed (400KHz) mode, the ones I have are 800KHz. The way the WS2811 protocol works is that there is a low-to-high transition at the beginning of each bit cell, then a high-to-low transition at a variable point within the cell, depending if the bit value is 0 or 1. For a logical zero the transition is near to the beginning of the cell, for a logical one it is later on in the cell. The exact timings seem to vary depending on which source you believe (800KHz mode):

Source | cell timing | logical 0 high | logical 0 low | logical 1 high | logical 1 low |
---|---|---|---|---|---|

WS2811 datasheet | 1.25 usec | 250 nsec | 1000 nsec | 600 nsec | 650 nsec |

aliexpress.com | 1.25 usec | 350 nsec | 800 nsec | 700 nsec | 600 nsec |

doityourselfchristmas.com | 1.25 usec | 250 nsec | 1000 nsec | 1000 nsec | 250 nsec |

The timings seem to be all over the place, in particular the aliexpress ones don't even add up to the bit cell length of 1.25usec! The doityourselfchristmas.com explanation of how the WS2811 works made sense, so I used the timings from there and put together a simple test using the delay_x.h that's floating around the net. That worked OK for a single pixel but if I tried slow fades or driving more than one pixel I got a lot of jittering, Hmm, OK, let's look at the timings again. I'm using a 16MHz AVR, so each clock cycle is 62.5 nsec long. The short pulses in the WS2811 protocol are 250 nsec long and each bit cell is 1.25 msec long. Wow, that's only 4 clock cycles for the short pulses, and only 20 cycles for each bit cell, and the allowed +/- timing variation is 75 nsec, which is just over 1 clock cycle. Hmm, that means that driving these with a simple C routine is unlikely to be sufficient. I spent a bit of time looking to see if there was any sort of hardware assist that could be brought to bear, but even SPI at 4MHz, close to the maximum that the MCU can support, wouldn't be fast enough as it would still be necessary to marshal each byte into a series of 5-bit patterns to get the timings right for the WS2811 protocol. And anything interrupt-driven is also out, as it takes 5 clocks just to dispatch an interrupt and we only have 20 cycles to play with. That only leaves bit-banging which I generally try to avoid, but because of the relatively high speed of the WS2811 we could update 100 pixels every 10 msec using around 33% of the available CPU, which is perfectly acceptable. There's another oddity as well - although the WS2811 takes the 8-bit colour value in RGB order, the pixels have been wired up so the order is GRB, which makes life a little more complicated as the bytes need reordering on output.

OK, so the only realistic option looks like it is going to be some had-crafted assembler. Although this post on arduino.cc suggests it is not possible to meet the timing constraints, I thought it was possible - if not particularly simple, and indeed that's the case. Anyway, to cut to the chase, I've put a copy of the resulting code on SourceForge, and there's a demo of it in use there as well. Some notes about the implementation:

- The basic algorithm is to have an outer loop that iterates over the array of RGB values we've been passed and an inner loop that iterates over each 8-bit R, G or B value, setting the output pin as necessary. This is made somewhat more complicated than it should be because the WS2811 pixels I have are wired in (GRB) order rather than (RGB) order.
- This code requires instantiating for each port/pin combination it is used on. The reason is that dereferencing a port pointer and assigning a value to it takes 4 cycles, which is too long to be usable here bearing in mind we only have 4 instructions to toggle the pin low/high/low or high/low/high as appropriate to produce the short 250nsec pulse that's required.
- If we want to keep the timings accurate it is necessary to run with interrupts disabled.
- Conditional branch instructions on the AVR take a different number of clock cycles depending on whether they are true or false. It's therefore necessary to insert additional instructions to equalise the time taken by the true and false paths. That means a bit test and pin set takes 8 cycles, once code to equalise the timings is added. That's nearly half of the 20 cycles we have available per bit.
- We only need to do the inner 8-bit loop bit-test-and-set-pin once per bit, to see if it is a 0 bit. If it is, we set the output pin low at 250nsec into the bit cell. For 1 bit we don't need to test at all, we just need to unconditionally set the output pin low 1000nsec into the bit cell. That's because if we are outputting a 0 bit the output pin will already have been set low at 250nsec and the additional set to low at 1000nsec will have no effect. On the other hand, if we are outputting a 1 bit we'll correctly changing the pin from high to low at 1000nsec.
- We can't leave the outer loop testing, to see if we've reached the end of the array of RGB values, until after we've output each 24-bit RGB value. If we did we'd introduce jitter between one set of 24 bits and the next. We therefore have to interleave the necessary outer loop housekeeping with the inner 8-bit loops that do the actual bit output.
- We only have at most 6 cycles free per bit once all the inner loop testing pin setting and loop handling is accounted for. We've already established that it takes around 8 cycles to do a conditional bit-test-and-pin-set and perform the necessary adjustments to keep the timings the same - the bare minimum to do a test that takes the same time down the true and false paths is 4 cycles. We need to only do the interleaved outer loop handling on the last iteration of the inner loops so that we don't end up doing it multiple times per RGB value - but it's going to take a minimum of 4 cycles just to do the necessary test, and we only have 6 cycles available.
- To solve that problem we partially unroll the R and B loops. We loop over the R and B bit values 7 times and output the 8th bit with an unrolled version of the loop. That means there's no need to explicitly test if we are on the last iteration of the R or B loop as we just 'fall through' from the 7th iteration of the loop. That saves us sufficient cycles to be able to interleave the outer loop handling with the handling of the 8th bit of the R and B values.
- Setting an output pin doesn't change any of the flags in the status register, so it is possible to perform a test then set an output pin and then perform a conditional jump using the result of the prior test.
- Conditional jumps can only be made -64/+63 bytes relative to the current program counter, if we need to jump further it needs a combination of a local conditional brach and a long-range jump.

To validate the timings I hooked up the Minimus to a scope and verified that the timings were as expected, and they are as per the table above. In particular, the overall period per 8 bits is exactly 10 usec, with no jitter between one 24-bit RGB value and the next (click on the images for a larger version).

1 bit, low |
1 bit, high |

In addition, I also looked at the output of the pixel, which is passed down to the rest of the chain. That revealed that there is a delay of approximately 200 nsec per pixel, and that the signal is reshaped before being passed to the next pixel in the chain. The timings are not the same as those specified in the datasheet, which suggests to me that the datasheet timings are most likely just an average of the output timings of a sample of chips rather than being a characterisation of the operational input range of the chips.

**Input versus output of a pixel, high and low cells**

The output timings are as follows:

logical 0 high | logical 0 low | logical 1 high | logical 1 low |
---|---|---|---|

338 nsec | 912 nsec | 680 nsec | 570 nsec |

That leads me to suspect that the most important thing when driving the WS2811 is not the exact intra-cell timings for low and high bits, it is getting the bit rate as close to the specified 800 KHz as possible and in avoiding jitter between each block of 24 bits. The code I've linked to above does exactly that, so although I've only tested it on a short string it should be fine for driving much longer ones as well.

And finally, here is the obligatory YouTube video clip - enjoy ;-) The pixels are so bright I had to put 4 layers of paper in front of them to stop the camera overloading. The shot of the scope shows the input to the first pixel at the top, in red. The yellow trace below is the output of the first pixel and that's fed in to the input of the second pixel, and so on. As you can see, the bottom trace is 1/3 shorter than the top trace as this chain has 3 pixels in it. The overall pulse train is 90usec, each pixel taking 30usec to refresh. That comes out as a bit rate of 800KHz, as per the datasheet.

# Update

This post got mentioned on Hackaday, after which I've had a lot of feedback, both on Hackaday and here. Some of it has been good, some has been, well, let's just call it ill-informed.

**Please** don't bother telling me that you can do this with an Xmega, a PIC, an ARM or whatever. All that proves is you've entirely missed the point of this post.

Some of the comments have been along the lines of "Why don't you use hardware SPI, it works for me". Firstly, the WS2811 is **not** a SPI device but if you **do** have it working, please leave me a note saying what SPI settings you used, because I've not found an obvious way of getting the right 800KHz data rate and the right mark/space ratio that the WS2811 requires, or of avoiding jitter as the ATmega SPI hardware is not double-buffered. Note in particular if you are using the FastSPI library you are **not** using hardware SPI. When driving the WS2811, TM1809 or TM1804, FastSPI uses bit-banging, as does this code.

Another suggestion is to use the USART in synchronous mode and set to 5 bits per byte. The problem is that the ATmega sends start and stop bits even in synchronous mode, and the signal polarity is wrong as well. There's an inconclusive discussion on Hackaday about this option, but I don't think it's practical. And, as I note above, because of the data rates required, even if you use hardware you are still going to spend most of the available cycles managing it.

I've also been told that I've wasted my time because the FastSPI library can already drive these chips. FastSPI is a fine library, but if you search around you'll find people who have had trouble getting it to work with the WS2811 (including in the comments to this post). I've done an investigation into FastSPI and the possible causes of the problems people have getting it to work and I have the following comments to make:

- FastSPI depends on the Arduino environment and libraries, which I don't use. It's therefore no use to me. My code has no dependencies on the Arduino environment.
- The WS2811 datasheet specifies that the allowed variation is +-75 nsec, that's just over 1 clock cycle (62.5 nsec @ 16MHz) so to stay within spec the timings have to be accurate to +-1 cycle. For '0' bits the FastSPI library is spot-on at 1250 nsec but for '1' bits it is 1 cycle over at 1312.5 nsec. That's just within spec.
- The FastSPI code sends 3 blocks of 8 bits for each RGB value. The 8th bit of each block is significantly out of spec, 1625 nsec for '0' bits and 1687.5 nsec for '1' bits compared to the spec value of 1250 nsec.
- Between each RGB block (24 bits) there's an even bigger out of spec gap of 2062.5 nsec which is 65% longer than it should be.
- The overall effect is that, worst-case, the pulse train that FastSPI outputs can be up to 10% longer than it should be. With some batches of chips you may get away with this but with others you may not, which is most likely why for some people FastSPI works and OK and for others it doesn't - it's down to luck. And as I noted above, individual bit jitter is probably more problematic than a pulse chain that is slightly too fast or too slow.
- Finally, a simple test program that uses FastSPI to set 3 LEDs to a fixed value is 11048 bytes long. The equivalent program using my code is 450 bytes - about 25 times smaller. On the board I'm using, FastSPI would use up 1/3 of the available program memory and that's more than I can afford. The reason for this difference is simple, FastSPI supports multiple LED driver chips and even allows you to select them at run-time whereas mine is just intended to drive the WS2811. That's a classic flexibility/space design tradeoff that in my case doesn't work out in FastSPI's favour, your mileage may of course vary.

Finally, here's a scope trace showing 2 RGB values being output by FastSPI (top trace) and WS2811.h (bottom trace). You can see the jitter between the 8-bit blocks and the 32-bit RGB blocks on the FastSPI trace.

# Sensor smoothing and optimised maths on the Arduino

As part of my attempt to load as much as possible onto an Arduino, I added a LIS302DL accelerometer into the mix and used the Y axis output to drive a strip of LEDs via a couple of TLC5916 LED drivers. I was sampling the accelerometer every 10msec and it jittered a little, so I wanted to add some smoothing to the sensor values. There are all sorts of fancy smoothing algorithms available, but a simple Exponential moving average worked just fine, and was easy to calculate:

// choose a weighting factor W between 0 .. 1, then // current average = (current sensor value * W) + (last average * (1 - W)) static uint16_t avg = 0; uint8_t val = getAccelValue(); avg = val * 0.1 + avg * 0.9;

The actual sensor value is a signed 8-bit integer and the normal 1G reading (i.e. when it's being simply tilted and not waved around) ranges from around -55 to +55, so I capped the value at +/-50 and added 50 to it to get a value between 0 and 100 that I could then feed into the exponential moving average calculation before mapping the value onto the LED strip, with 'horizontal' resulting in the middle LED on the strip being illuminated, and the lit LED moving from side to side as the strip is tilted.

All well and good, the maths is simple and everything works fine. However the AVR doesn't have floating-point hardware support, so I wanted to see if I could improve the performance of the calculation above. Using integer arithmetic is the most obvious way, but that can't be done directly as there are fractional factors used in the calculation. However there's a standard way of doing this which is to use Fixed-point arithmetic. To do this, all the numbers involved are multiplied by a scaling factor to in effect move the decimal point to the right. An example would be doing financial calculations in pence rather than pounds. Addition/subtraction of scaled numbers works as normal as does multiplication/division of a scaled by an unscaled number. The operations that need special handling are multiplication/division of two scaled numbers. Here's why: if we have two numbers `a`

and `b`

that are scaled by `S`

and we multiply them we get `aS * bS`

. That simplifies to `abS`

, so to keep the scaling correct we need to divide by the scale factor after multiplication so we end up with ^{2}`abS`

. Division is the inverse, we have to multiply by S to keep things straight. Finally, as we are dealing with binary numbers, if we make the scale factor a power of two we can implement the scale factor calculations as efficient bit shifts rather than having to do multiplications and divisions.

There is in fact limited support for fixed-point math in AVR gcc but as far as I can tell it's only available as a patch, which would mean having to recompile my AVR toolchain. However it's simple enough to implement fixed-point yourself, so that's what I did. The first and important thing is to choose the scale factor, being careful to pick a value that doesn't result in overflow at any point in the calculation. In this case, I knew the range of sensor values was going to be between 0 and 100. If I choose a scale factor of 2^{5} the calculation above in fixed-point form would therefore be:

static uint16_t avg = 0; uint8_t val = getAccelValue(); avg = ((val << 5) / 10) + (avg * 9 / 10); uint8_t currVal = (avg + 16) >> 5;

So the maximum intermediate value would be `100 * 2`

or 28800 which is well within the bounds of a 16-bit unsigned integer. A couple of things to note: the scaling/unscaling is done by simply shifting by 5 bits left or right as appropriate, and the 0.1 and 0.9 multiplications of the floating-point version become ^{5} * 9`/ 10`

and `* 9 / 10`

respectively. Finally, to get the unscaled average value we need to do the equivalent of adding 0.5 and rounding before unscaling the value which is what the addition of 16 is for - `2`

. Next I wrote a small program that fed a range of values into both the floating and fixed point versions and compared the results. In all cases, the difference between the floating and fixed point versions was no more than +-1, which was perfectly adequate. Because the sensor values are being sampled at 100Hz the fixed-point exponential moving average converges to the same value as the floating-point value within only a few samples even when there is a difference.
^{5} / 2 = 16

The next step was to actually measure how long the floating-point and fixed-point calculations took. To do that, I wrote a little stand-alone program that fed the values 0 .. 100 into the calculation, and then repeated that 1000 times. I then recorded the start and end time in microseconds around the loop, which gave me the time to execute 100,000 of the calculations. I also timed an empty loop containing just a NOP instruction so I could subtract the loop overhead from the calculation. The CPU is running at 16MHz and each NOP takes 1 cycle, so I can subtract the time for the NOPs from the total overhead. Without the NOPs there's a good chance the gcc optimiser will figure out that the loop does nothing and optimise it away entirely, defeating the purpose of the overhead calculation.

// Overhead. uint32_t start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (int8_t val = 0; val <= 100; val++) { _NOP(); } } uint32_t ohead = micros() - start - (1000L * 100 / 16); printf_P(PSTR("overhead %.2f\n"), ohead / 100000.0); // Floating point 1:10. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (int8_t val = 0; val <= 100; val++) { v1 = val * 0.1 + v1 * 0.9; } } printf_P(PSTR("floating point 1:10 %.2f\n"), (micros() - start - ohead) / 100000.0);avr200 // Fixed point 1:10. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v2 = ((val << 5) / 10) + (v2 * 9 / 10); v3 = (v2 + 16) >> 5; } } printf_P(PSTR("fixed point 1:10 %.2f\n"), (micros() - start - ohead) / 100000.0);

OK, let's run that and look at the results, the CPU clock rate is 16MHz and the timings are in microseconds per calculation:

overhead 0.20 floating point 1:10 27.95 fixed point 1:10 32.48

Wait a minute - the fixed-point calculation is **slower** than the floating-point version! How on earth can that be? Hmmm. Well, the floating and fixed-point versions aren't exactly equivalent. The floating-point version uses two multiplies, the fixed-point version uses two divisions and a multiply. Let's rewrite the floating-point version and make it exactly equivalent to the fixed-point version:

// Floating point 1:10 division. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v1 = val / 10.0 + v1 * 9.0 / 10.0; } } printf_P(PSTR("floating point 1:10 division %.2f\n"), (micros() - start - ohead) / 100000.0);

Now the results are:

overhead 0.20 floating point 1:10 27.95 fixed point 1:10 32.48 floating point 1:10 division 79.39

OK, that at least explains what the problem is - it's the division steps in the calculation. Whilst the AVR has 8-bit hardware multiply instructions, it has no hardware division. It turns out that division is harder and therefore and slower to implement than multiplication, both in hardware and software. Multiplication can be done by simple repeated addition, division requires test subtractions and comparisons. There's a good Atmel application note AVR200 that gives some comparative timings for multiplication and division implemented entirely in software:

Application | Code Size (Words) | Execution Time (Cycles) |
---|---|---|

8 x 8 = 16 bit unsigned (Code Optimized) | 9 | 58 |

8 x 8 = 16 bit unsigned (Speed Optimized) | 34 | 34 |

8 x 8 = 16 bit signed (Code Optimized) | 10 | 73 |

16 x 16 = 32 bit unsigned (Code Optimized) | 14 | 153 |

16 x 16 = 32 bit unsigned (Speed Optimized) | 105 | 105 |

16 x 16 = 32 bit signed (Code Optimized) | 16 | 218 |

8 / 8 = 8 + 8 bit unsigned (Code Optimized) | 14 | 97 |

8 / 8 = 8 + 8 bit unsigned (Speed Optimized) | 66 | 58 |

8 / 8 = 8 + 8 bit signed (Code Optimized) | 22 | 103 |

16 / 16 = 16 + 16 bit unsigned (Code Optimized) | 19 | 243 |

16 / 16 = 16 + 16 bit unsigned (Speed Optimized) | 196 | 173 |

16 / 16 = 16 + 16 bit signed (Code Optimized) | 39 | 255 |

So even without hardware division support there's a significant difference in the speed/space trade-offs between multiplication and division. Add in to that the fact that the AVR has hardware multiply but no hardware divide and it's not really surprising that there's a big performance difference between multiplication and division. The question is, is there anything we can do about it?

Well, it turns out there is. We avoided division in the fixed-point scaling operations by using bit shifts, we can do so with the exponential moving average calculation as well by choosing a decay factor that is a power of two. I chose 2^{4} as the value, but I then needed to recheck that the intermediate results wouldn't overflow. As above, the biggest term in the calculation is `100 * 2`

which is 48000, so we are OK. To keep the comparisons fair I modified the floating-point version to use the same decay factor as the fixed-point version, giving:
^{5} * 15

// Floating point 1:16 multiplication. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v1 = val * 0.0625 + v1 * 0.9375; } } printf_P(PSTR("floating point 1:16 multiplication %.2f\n"), (micros() - start - ohead) / 100000.0); // Fixed-point multiplication & shift. start = micros(); for (uint16_t i = 1000; i != 0; i--) { for (uint8_t val = 0; val <= 100; val++) { v2 = (val << 1) + ((v2 * 15) >> 4); v3 = (v2 + 16) >> 5; } } printf_P(PSTR("fixed point 1:16 multiplication & shift %.2f\n"), (micros() - start - ohead) / 100000.0);

Note the scaling of `val`

by 2^{5} followed by division by 2^{4} is the same as shifting left by one bit, and the division of `v2`

by 16 is implemented by shifting it left 4 bits. OK, what timings do we get now?

floating point 1:16 multiplication 28.43 fixed point 1:16 multiplication & shift 4.99

OK, that looks **much** better, the fixed-point version is now 5.7x faster than the floating-point version, which is the sort of speed up we were looking for. But before we declare victory, is there any more juice to be squeezed out? Well, it turns out there is. Firstly, C always promotes integer operands in arithmetic operations to ints, which are 16 bits on the AVR, so our 16x8 bit calculation becomes a 16x16 bit one. Also, looking at the assembler output reveals that the bit shifts are all implemented as loops - the AVR can only shift a register one bit at a time. We can probably get some benefit by implementing our own 16x8 bit multiplication and unrolling the bit shift loops - time to pull out the AVR assembler documentation! Here's the code:

// C equivalent // v1 = (val << 1) + ((v1 * 15) >> 4); // v2 = (v1 + 16) >> 5; asm( "; v1 *= 15\n" " ldi r16, 15\n" " mov r17, %B[v1]\n" " mul %A[v1], r16\n" " movw %[v1], r0\n" " mulsu r17, r16\n" " add %B[v1], r0\n" "; v1 >>= 4\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" " lsr %B[v1]\n" " ror %A[v1]\n" "; val <<= 1\n" " mov r0, %[val]\n" " lsl r0\n" "; v1 += val\n" " clr r1\n" " add %A[v1], r0\n" " adc %B[v1], r1\n" "; v2 = v1\n" " movw %[v2], %[v1]\n" "; v2 += 16\n" " ldi r16, 16\n" " add %A[v2], r16\n" " adc %B[v2], r1\n" "; v2 >>= 5\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" " lsr %B[v2]\n" " ror %A[v2]\n" : [v1] "+a" (v4), [v2] "=a" (v5) : [val] "r" (val) : "r16", "r17" )

The only slightly tricksy bit is implementing a 16x8 bit multiplication using the 8x8 bit hardware multiplier on the AVR. In principle it's no different to the way you were probably taught to do long multiplication in primary school, with some tweaks to take advantage of the fact that we know the result of the multiplication will always fit in 16 bits rather than 24. If you want more details on how this works I can recommend this blog post and the Atmel AVR201 application note "Using the AVR hardware Multiplier" for more information.

OK, what are the timings for the assembler version?floating point 1:16 multiplication 28.43 fixed point 1:16 multiplication & shift 4.99 fixed point assembler 3.09

So the assembler version is 1.6x faster than the fixed-point C version, and nearly 10x faster than the floating-point version. Finally, what conclusions can we draw from all this? Well, the following ones leap out to me:

- Division on the AVR is
**slow**, whether it be floating-point or integer. Avoid it wherever you can. - If you
**have**to use division, it may well be faster to use floating-point and multiply by the reciprocal of the numbers you were dividing by. - Fixed-point math is fairly easy to do and can yield significant performance benefits, as long as you avoid those pesky divisions. If you can't it may not offer much benefit over multiply-only floating-point.
- Hand-coded assembler will help if you need to squeeze out every last cycle, but in absolute terms the speed-up you'll get by using C fixed-point multiply-and-shift-only will probably be sufficient.