Driving the WS2811 at 800KHz with a 16MHz AVR

Recently some of the folks at the Manchester Hackspace did a bulk-order of WS8211 LED pixels. These are available from several vendors on aliexpress.com (search for "WS2811 5050 RGB") and they combine a 5050 RGB LED with a WS2811 driver chip. The WS2811 provides 24-bit RGB colour plus constant-current output, so no external components are required, although the datasheet does suggest the addition of some impedance matching and decoupling resistors and capacitors. The WS2811 can run at a data rate of either 400KHz or 800KHz, although the 800KHz ones seem more common. Having got hold of some, I set about getting them working using a Minumus 32 board which has an Atmel ATmega32U2 running at 16MHz.

Note also that this part, the WS2811, is commonly confused with one with a very similar name, the WS2801, but they are radically different beasts. The WS2801 has a SPI interface which means you need to provide both a clock and a data signal. That in turn means you can send data at a wide range of speeds (the WS2801 datasheet says up to 25MHz) and still have everything work fine as the clock line signals the WS2801 when to sample the data line. Plus most MCUs have hardware SPI which makes driving the WS2801 pretty much a doddle.

The WS2811 oh the other hand uses a rather unusual control scheme. It uses a single combined clock and data line. You reset the chain by keeping the input low for around 50usec (less will usually work as well), then start sending 24-bit RGB sequences in a continuous stream. The first LED in the chain displays the first RGB value to be sent and passes the rest along the chain, the second displays the second value and so on. The datasheet gives the required timings, and there are a couple of writeups here and here. Mostly these are talking about the WS2811 in low speed (400KHz) mode, the ones I have are 800KHz. The way the WS2811 protocol works is that there is a low-to-high transition at the beginning of each bit cell, then a high-to-low transition at a variable point within the cell, depending if the bit value is 0 or 1. For a logical zero the transition is near to the beginning of the cell, for a logical one it is later on in the cell. The exact timings seem to vary depending on which source you believe (800KHz mode):

Sourcecell timinglogical 0 highlogical 0 lowlogical 1 highlogical 1 low
WS2811 datasheet1.25 usec250 nsec1000 nsec600 nsec650 nsec
aliexpress.com1.25 usec350 nsec800 nsec700 nsec600 nsec
doityourselfchristmas.com1.25 usec250 nsec1000 nsec1000 nsec250 nsec

The timings seem to be all over the place, in particular the aliexpress ones don't even add up to the bit cell length of 1.25usec! The doityourselfchristmas.com explanation of how the WS2811 works made sense, so I used the timings from there and put together a simple test using the delay_x.h that's floating around the net. That worked OK for a single pixel but if I tried slow fades or driving more than one pixel I got a lot of jittering, Hmm, OK, let's look at the timings again. I'm using a 16MHz AVR, so each clock cycle is 62.5 nsec long. The short pulses in the WS2811 protocol are 250 nsec long and each bit cell is 1.25 msec long. Wow, that's only 4 clock cycles for the short pulses, and only 20 cycles for each bit cell, and the allowed +/- timing variation is 75 nsec, which is just over 1 clock cycle. Hmm, that means that driving these with a simple C routine is unlikely to be sufficient. I spent a bit of time looking to see if there was any sort of hardware assist that could be brought to bear, but even SPI at 4MHz, close to the maximum that the MCU can support, wouldn't be fast enough as it would still be necessary to marshal each byte into a series of 5-bit patterns to get the timings right for the WS2811 protocol. And anything interrupt-driven is also out, as it takes 5 clocks just to dispatch an interrupt and we only have 20 cycles to play with. That only leaves bit-banging which I generally try to avoid, but because of the relatively high speed of the WS2811 we could update 100 pixels every 10 msec using around 33% of the available CPU, which is perfectly acceptable. There's another oddity as well - although the WS2811 takes the 8-bit colour value in RGB order, the pixels have been wired up so the order is GRB, which makes life a little more complicated as the bytes need reordering on output.

OK, so the only realistic option looks like it is going to be some had-crafted assembler. Although this post on arduino.cc suggests it is not possible to meet the timing constraints, I thought it was possible - if not particularly simple, and indeed that's the case. Anyway, to cut to the chase, I've put a copy of the resulting code on SourceForge, and there's a demo of it in use there as well. Some notes about the implementation:

  • The basic algorithm is to have an outer loop that iterates over the array of RGB values we've been passed and an inner loop that iterates over each 8-bit R, G or B value, setting the output pin as necessary. This is made somewhat more complicated than it should be because the WS2811 pixels I have are wired in (GRB) order rather than (RGB) order.
  • This code requires instantiating for each port/pin combination it is used on. The reason is that dereferencing a port pointer and assigning a value to it takes 4 cycles, which is too long to be usable here bearing in mind we only have 4 instructions to toggle the pin low/high/low or high/low/high as appropriate to produce the short 250nsec pulse that's required.
  • If we want to keep the timings accurate it is necessary to run with interrupts disabled.
  • Conditional branch instructions on the AVR take a different number of clock cycles depending on whether they are true or false. It's therefore necessary to insert additional instructions to equalise the time taken by the true and false paths. That means a bit test and pin set takes 8 cycles, once code to equalise the timings is added. That's nearly half of the 20 cycles we have available per bit.
  • We only need to do the inner 8-bit loop bit-test-and-set-pin once per bit, to see if it is a 0 bit. If it is, we set the output pin low at 250nsec into the bit cell. For 1 bit we don't need to test at all, we just need to unconditionally set the output pin low 1000nsec into the bit cell. That's because if we are outputting a 0 bit the output pin will already have been set low at 250nsec and the additional set to low at 1000nsec will have no effect. On the other hand, if we are outputting a 1 bit we'll correctly changing the pin from high to low at 1000nsec.
  • We can't leave the outer loop testing, to see if we've reached the end of the array of RGB values, until after we've output each 24-bit RGB value. If we did we'd introduce jitter between one set of 24 bits and the next. We therefore have to interleave the necessary outer loop housekeeping with the inner 8-bit loops that do the actual bit output.
  • We only have at most 6 cycles free per bit once all the inner loop testing pin setting and loop handling is accounted for. We've already established that it takes around 8 cycles to do a conditional bit-test-and-pin-set and perform the necessary adjustments to keep the timings the same - the bare minimum to do a test that takes the same time down the true and false paths is 4 cycles. We need to only do the interleaved outer loop handling on the last iteration of the inner loops so that we don't end up doing it multiple times per RGB value - but it's going to take a minimum of 4 cycles just to do the necessary test, and we only have 6 cycles available.
  • To solve that problem we partially unroll the R and B loops. We loop over the R and B bit values 7 times and output the 8th bit with an unrolled version of the loop. That means there's no need to explicitly test if we are on the last iteration of the R or B loop as we just 'fall through' from the 7th iteration of the loop. That saves us sufficient cycles to be able to interleave the outer loop handling with the handling of the 8th bit of the R and B values.
  • Setting an output pin doesn't change any of the flags in the status register, so it is possible to perform a test then set an output pin and then perform a conditional jump using the result of the prior test.
  • Conditional jumps can only be made -64/+63 bytes relative to the current program counter, if we need to jump further it needs a combination of a local conditional brach and a long-range jump.

To validate the timings I hooked up the Minimus to a scope and verified that the timings were as expected, and they are as per the table above. In particular, the overall period per 8 bits is exactly 10 usec, with no jitter between one 24-bit RGB value and the next (click on the images for a larger version).

1 bit, low
1 bit, low
1 bit, high
1 bit, high

In addition, I also looked at the output of the pixel, which is passed down to the rest of the chain. That revealed that there is a delay of approximately 200 nsec per pixel, and that the signal is reshaped before being passed to the next pixel in the chain. The timings are not the same as those specified in the datasheet, which suggests to me that the datasheet timings are most likely just an average of the output timings of a sample of chips rather than being a characterisation of the operational input range of the chips.

Input versus output of a pixel, high and low cells
in/out signal

The output timings are as follows:

logical 0 highlogical 0 lowlogical 1 highlogical 1 low
338 nsec912 nsec680 nsec570 nsec

That leads me to suspect that the most important thing when driving the WS2811 is not the exact intra-cell timings for low and high bits, it is getting the bit rate as close to the specified 800 KHz as possible and in avoiding jitter between each block of 24 bits. The code I've linked to above does exactly that, so although I've only tested it on a short string it should be fine for driving much longer ones as well.

Two pixels, daisy-chained
daisy chain

And finally, here is the obligatory YouTube video clip - enjoy ;-) The pixels are so bright I had to put 4 layers of paper in front of them to stop the camera overloading. The shot of the scope shows the input to the first pixel at the top, in red. The yellow trace below is the output of the first pixel and that's fed in to the input of the second pixel, and so on. As you can see, the bottom trace is 1/3 shorter than the top trace as this chain has 3 pixels in it. The overall pulse train is 90usec, each pixel taking 30usec to refresh. That comes out as a bit rate of 800KHz, as per the datasheet.

Update

This post got mentioned on Hackaday, after which I've had a lot of feedback, both on Hackaday and here. Some of it has been good, some has been, well, let's just call it ill-informed.

Please don't bother telling me that you can do this with an Xmega, a PIC, an ARM or whatever. All that proves is you've entirely missed the point of this post.

Some of the comments have been along the lines of "Why don't you use hardware SPI, it works for me". Firstly, the WS2811 is not a SPI device but if you do have it working, please leave me a note saying what SPI settings you used, because I've not found an obvious way of getting the right 800KHz data rate and the right mark/space ratio that the WS2811 requires, or of avoiding jitter as the ATmega SPI hardware is not double-buffered. Note in particular if you are using the FastSPI library you are not using hardware SPI. When driving the WS2811, TM1809 or TM1804, FastSPI uses bit-banging, as does this code.

Another suggestion is to use the USART in synchronous mode and set to 5 bits per byte. The problem is that the ATmega sends start and stop bits even in synchronous mode, and the signal polarity is wrong as well. There's an inconclusive discussion on Hackaday about this option, but I don't think it's practical. And, as I note above, because of the data rates required, even if you use hardware you are still going to spend most of the available cycles managing it.

I've also been told that I've wasted my time because the FastSPI library can already drive these chips. FastSPI is a fine library, but if you search around you'll find people who have had trouble getting it to work with the WS2811 (including in the comments to this post). I've done an investigation into FastSPI and the possible causes of the problems people have getting it to work and I have the following comments to make:

  • FastSPI depends on the Arduino environment and libraries, which I don't use. It's therefore no use to me. My code has no dependencies on the Arduino environment.
  • The WS2811 datasheet specifies that the allowed variation is +-75 nsec, that's just over 1 clock cycle (62.5 nsec @ 16MHz) so to stay within spec the timings have to be accurate to +-1 cycle. For '0' bits the FastSPI library is spot-on at 1250 nsec but for '1' bits it is 1 cycle over at 1312.5 nsec. That's just within spec.
  • The FastSPI code sends 3 blocks of 8 bits for each RGB value. The 8th bit of each block is significantly out of spec, 1625 nsec for '0' bits and 1687.5 nsec for '1' bits compared to the spec value of 1250 nsec.
  • Between each RGB block (24 bits) there's an even bigger out of spec gap of 2062.5 nsec which is 65% longer than it should be.
  • The overall effect is that, worst-case, the pulse train that FastSPI outputs can be up to 10% longer than it should be. With some batches of chips you may get away with this but with others you may not, which is most likely why for some people FastSPI works and OK and for others it doesn't - it's down to luck. And as I noted above, individual bit jitter is probably more problematic than a pulse chain that is slightly too fast or too slow.
  • Finally, a simple test program that uses FastSPI to set 3 LEDs to a fixed value is 11048 bytes long. The equivalent program using my code is 450 bytes - about 25 times smaller. On the board I'm using, FastSPI would use up 1/3 of the available program memory and that's more than I can afford. The reason for this difference is simple, FastSPI supports multiple LED driver chips and even allows you to select them at run-time whereas mine is just intended to drive the WS2811. That's a classic flexibility/space design tradeoff that in my case doesn't work out in FastSPI's favour, your mileage may of course vary.

Finally, here's a scope trace showing 2 RGB values being output by FastSPI (top trace) and WS2811.h (bottom trace). You can see the jitter between the 8-bit blocks and the 32-bit RGB blocks on the FastSPI trace.

FastSPI (top) versus WS2811.h (bottom)
WS2811.h versus FastSPI

Categories : Tech, AVR