Project Background
First, some background information on the project. We had prototyped an arduino and nRF24L01+ based sensor board using mysensors and home assistant. The main goal of this board was whole home temperature, humidity, and light level monitoring. The prototype was built from stand alone protoboards, breadboards, wires, etc. It worked great, so we decided to make a custom PCB that would have a socket for the arduino pro mini, a socket for the nRF24L01+ boards, and all of the other circuitry (sensors, power supply, etc).
Image 1: Sensor Board! (Resistor for scale)
Problem
After building up a number of the “New and Improved!” custom PCBs, we noticed that some of them just didn’t work very well. We could read the data from all the sensors via the arduino, but they would fail in one of a few ways:
- Not present with the gateway / Nothing seemingly happens.
- Present but have terrible “message loss”. The original behavior was intermittent updates from the sensors board. To determine our “message loss” we created a debug program that would send a finite number of messages from the sensor board with an ack request and wait for the ack with a time out. The final “message loss” was (number of messages sent ) - (number of messages ack’d).
Initial observations and debugging (in no particular order of desperation).
- More bulk capacitance on the nRF24L01+ board. No luck. (BUT THE FORUMS PROMISED)
- Changing power source (Batteries, Benchtop, etc) didn’t change behavior.
- Changing the nRF24L01+ on our gateway (Raspberry Pi) didn’t change behavior.
- A “problem” sensor board got “better” when we swapped out the nRF24L01+ board. (AHA! It must be the questionably authentic nRF24L01+ boards we ordered). So, we built a sensor board with a header socket so we could quickly screen/test all of our nRF24L01+ boards. This was ineffective and led to confusion (Wait, I thought you said this nRF24L01+ board worked?) as some nRF24L01+ boards worked in the screening unit, but then didn’t work in their final sensor board.
- Changing distance between sensor boards and gateway didn’t change anything in a meaningful or coherent way.
- Accidentally holding our thumb on the antenna changed the behavior. (Hey, wait, what just happened? It’s working! Wait.. no it isn't’.. What?)
- Changing the data (symbol) rate of the nRF24L01+ board changed the behavior, but didn’t fix anything.
- Changing the SPI speed also changed the behavior, but didn’t fix anything.
- Changing the channel on the nRF24L01+ board helped at one person’s house but not other project member’s houses.
- Manipulate ground planes on sensor board near antenna. (dremeled away).
- Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)
{A year, a kid, and job changes later}
Eventually, we decided the project had languished long enough. Enough sleep had been lost! We were going to figure out this problem!
How
Throwing stuff at the wall to see what stuck wasn't working, so we decided we needed to get serious and capture the wireless data. We borrowed a software defined radio (SDR) to capture the 2.4GHz spectrum with the intent of first demodulating the signal to see if that showed the smoking gun (It did but we didn't realize it at the time).
Image 2: Demodulated Signal (Hey, that looks like a digital signal!)
Now that we can see the data to train on, we can decode the nRF24L01+ packet to see what’s going on. Specifically, the goal was to see if a packet was “good” or “bad” and our criteria for a “Good” packet was the Cyclic Redundancy Check (CRC) passing. (See image below).
Image 3: Decoded NRF24L01+ Packet with CRC Check.
Full of excitement and curiosity, we wielded our newfound RF power to decode the MySensors Payload as well (see Image 4 below) This didn’t prove to be very helpful for debug purposes, but it was interesting.
Image 4: Decoded Mysensors Payload
Now that we can see “passing” and “failing” packets, we know we are on the right track. However, we need to measure the transmission “quality” beyond just pass/fail. Since the transmission is just binary data (using Frequency Modulation) we can parse the data to assemble the data Eye pattern (Image 5 below).
Image 5: Impressively “Bad” data Eye pattern
I think I found the problem… But what is causing the Eye to close (good data Eye patterns are wide open in the middle)? Also, sometimes I get really good looking data Eyes! What is going on?!? Let’s have a closer look at a bad transmission. Images 6 and 7, shown below, are single transmissions taken from the multiple transmissions, previously shown in image 2.
Image 6: A closer look at a single “bad” demodulated transmission (Wow, that’s bad!)
Image 7: A closer look at a single “good” demodulated transmission (That looks perfect! How is this the same board?)
Comparing Images 6 and 7 is very interesting, we might be looking at the root cause here. The bad transmission looks like it has an “other” digital signal riding on top of it. What is going on during the nRF24L01+ transmission?
We decided to grab a bus analyzer and record communication on the SPI bus during a transmission (see Image 8 below).
Image 8: A Bus Analyzer capture during transmission (Wait a minute! That looks familiar!)
It seems strange that the SPI bus is constantly active during transmission. What is MySensors doing? Grab the shovel...
Digging into the RF24.cpp file shows the RF24_getStatus() function called continuously at line 326. This pumps the SPI interface (sends 0xFF) to read the status register of the nRF24L01+ and see if the transmission is complete.
A snippet of code from Github (“MySensors\hal\transport\driver\RF24.cpp”) shown below starting at line number 321
Code Snippet 1
// go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
RF24_ce(HIGH);
// timeout counter to detect HW issues
uint16_t timeout = 0xFFFF;
do {
RF24_status = RF24_getStatus();
} while (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
RF24_ce(LOW);
// reset interrupts
RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
// Max retries exceeded
if(RF24_status & _BV(RF24_MAX_RT)) {
// flush packet
RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n")); // max retries, no ACK
RF24_flushTX();
We have found our smoking gun!
“Solution”
But wait! Don’t we need to check the transmission status? The answer is: kind of, but not like this. Ideally, the nRF24L01+ signals the gateway/sensor when transmission is finished and transmit success/failure can be determined by checking the transmission status. (Alternatively, you could just place a time delay after TX Enable that waits the max duration of your transmission and retries. However this is application dependent, and may not fully resolve the issue if you have many retries).
A way that this could be achieved is through the use of the handy IRQ (Interrupt Request) line on the nRF240L01+. MySensors can make use of the IRQ line, however it does so in a different way than we want. MySensors uses the IRQ to signal the message handler to pull received messages from the nRF24L01+ internal FIFO to prevent overflow. This can be important for high traffic networks but sadly doesn’t help our situation.
We decided to look at the datasheet (Image 9 below) for the nRF24L01+ for potential solutions and found a workaround.
Image 9: Snippet from nRF24L01+ datasheet describing what the IRQ line can be attached to
Great! TX_DS (TX Data Sent) and MAX_RT (Max TX retries reached) are the two flags we want to monitor to remedy our issue. It so happens, the IRQ line is setup to respond to these flags by default already! However, MySensors does not listen to the IRQ line during transmission (as shown in Code Snippet 1). So, let’s fix that!
Below you can see the code with our “fix”.
Code Snippet 2
// go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
RF24_ce(HIGH);
// timeout counter to detect HW issues
uint16_t timeout = 0xFFFF;
do {
//RF24_status = RF24_getStatus();
RF24_status = hwDigitalRead(MY_RF24_IRQ_PIN);
} while (RF24_status && timeout--);
//}while (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
RF24_status = RF24_getStatus();
RF24_ce(LOW);
// reset interrupts
RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
// Max retries exceeded
if(RF24_status & _BV(RF24_MAX_RT)) {
// flush packet
RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n")); // max retries, no ACK
RF24_flushTX();
With the fix shown above, we recaptured transmissions from the same hardware shown in Image 5. Look how nice that data Eye pattern is! (as shown in Image 10 below).
Image 10: Data Eye pattern with transmission IRQ fix
WOW! That is an unbelievable improvement! Thankfully, all those “bad” or “questionable” sensor boards now work like a charm! They all successfully complete presentation and have very low message loss. Sadly, our custom PCB did not pin out the IRQ line so we had to solder a wire from IRQ to an unused pin.
Image 11: Workaround wire
Also, out of the 20ish nRF24L01+ boards all but one are the “clones” with the Shockburst bit inversion. So, this problem may not be as pronounced on genuine or “better clones”. However, it is generally good practice to quiet or mute unnecessary digital communication during transmit/receive if possible.
HALP
We are hardware folk by trade and things like ‘GitHub’ or “software best practices” are not our forte. (For Example, using a bus analyzer to sniff the SPI lines was much easier than digging into the code stack). If somebody was so willing, submitting this fix (or hopefully a better one!) would be great.