Skip to content
  • MySensors
  • OpenHardware.io
  • Categories
  • Recent
  • Tags
  • Popular
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo
  1. Home
  2. Troubleshooting
  3. nRF24L01+ Communication Failure: Root Cause and “Solution”

nRF24L01+ Communication Failure: Root Cause and “Solution”

Scheduled Pinned Locked Moved Troubleshooting
48 Posts 12 Posters 4.2k Views 17 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • odritterO Offline
    odritterO Offline
    odritter
    wrote on last edited by
    #1

    Project Background

    First, some background information on the project. We had prototyped an arduino and nRF24L01+ based sensor board using mysensors and home assistant. The main goal of this board was whole home temperature, humidity, and light level monitoring. The prototype was built from stand alone protoboards, breadboards, wires, etc. It worked great, so we decided to make a custom PCB that would have a socket for the arduino pro mini, a socket for the nRF24L01+ boards, and all of the other circuitry (sensors, power supply, etc).

    Image 1: Sensor Board! (Resistor for scale)
    0_1559701732624_DSC_4767.jpg

    Problem

    After building up a number of the “New and Improved!” custom PCBs, we noticed that some of them just didn’t work very well. We could read the data from all the sensors via the arduino, but they would fail in one of a few ways:

    • Not present with the gateway / Nothing seemingly happens.
    • Present but have terrible “message loss”. The original behavior was intermittent updates from the sensors board. To determine our “message loss” we created a debug program that would send a finite number of messages from the sensor board with an ack request and wait for the ack with a time out. The final “message loss” was (number of messages sent ) - (number of messages ack’d).

    Initial observations and debugging (in no particular order of desperation).

    • More bulk capacitance on the nRF24L01+ board. No luck. (BUT THE FORUMS PROMISED)
    • Changing power source (Batteries, Benchtop, etc) didn’t change behavior.
    • Changing the nRF24L01+ on our gateway (Raspberry Pi) didn’t change behavior.
    • A “problem” sensor board got “better” when we swapped out the nRF24L01+ board. (AHA! It must be the questionably authentic nRF24L01+ boards we ordered). So, we built a sensor board with a header socket so we could quickly screen/test all of our nRF24L01+ boards. This was ineffective and led to confusion (Wait, I thought you said this nRF24L01+ board worked?) as some nRF24L01+ boards worked in the screening unit, but then didn’t work in their final sensor board.
    • Changing distance between sensor boards and gateway didn’t change anything in a meaningful or coherent way.
    • Accidentally holding our thumb on the antenna changed the behavior. (Hey, wait, what just happened? It’s working! Wait.. no it isn't’.. What?)
    • Changing the data (symbol) rate of the nRF24L01+ board changed the behavior, but didn’t fix anything.
    • Changing the SPI speed also changed the behavior, but didn’t fix anything.
    • Changing the channel on the nRF24L01+ board helped at one person’s house but not other project member’s houses.
    • Manipulate ground planes on sensor board near antenna. (dremeled away).
    • Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

    {A year, a kid, and job changes later}

    Eventually, we decided the project had languished long enough. Enough sleep had been lost! We were going to figure out this problem!

    How

    Throwing stuff at the wall to see what stuck wasn't working, so we decided we needed to get serious and capture the wireless data. We borrowed a software defined radio (SDR) to capture the 2.4GHz spectrum with the intent of first demodulating the signal to see if that showed the smoking gun (It did but we didn't realize it at the time).

    Image 2: Demodulated Signal (Hey, that looks like a digital signal!)
    0_1559701980778_Carrier Startup.bmp

    Now that we can see the data to train on, we can decode the nRF24L01+ packet to see what’s going on. Specifically, the goal was to see if a packet was “good” or “bad” and our criteria for a “Good” packet was the Cyclic Redundancy Check (CRC) passing. (See image below).

    Image 3: Decoded NRF24L01+ Packet with CRC Check.
    0_1559702168258_nRF24L01_Packets.png

    Full of excitement and curiosity, we wielded our newfound RF power to decode the MySensors Payload as well (see Image 4 below) This didn’t prove to be very helpful for debug purposes, but it was interesting.

    Image 4: Decoded Mysensors Payload
    0_1559702307036_MySensors_Message.png

    Now that we can see “passing” and “failing” packets, we know we are on the right track. However, we need to measure the transmission “quality” beyond just pass/fail. Since the transmission is just binary data (using Frequency Modulation) we can parse the data to assemble the data Eye pattern (Image 5 below).

    Image 5: Impressively “Bad” data Eye pattern
    0_1559702381443_Original EYE.bmp

    I think I found the problem… But what is causing the Eye to close (good data Eye patterns are wide open in the middle)? Also, sometimes I get really good looking data Eyes! What is going on?!? Let’s have a closer look at a bad transmission. Images 6 and 7, shown below, are single transmissions taken from the multiple transmissions, previously shown in image 2.

    Image 6: A closer look at a single “bad” demodulated transmission (Wow, that’s bad!)
    0_1559702428757_Bad Continuous Transmission.bmp

    Image 7: A closer look at a single “good” demodulated transmission (That looks perfect! How is this the same board?)
    0_1559702454385_Good Continuous Transmission.bmp

    Comparing Images 6 and 7 is very interesting, we might be looking at the root cause here. The bad transmission looks like it has an “other” digital signal riding on top of it. What is going on during the nRF24L01+ transmission?

    We decided to grab a bus analyzer and record communication on the SPI bus during a transmission (see Image 8 below).

    Image 8: A Bus Analyzer capture during transmission (Wait a minute! That looks familiar!)
    0_1559702513350_DataBusCapture.jpg

    It seems strange that the SPI bus is constantly active during transmission. What is MySensors doing? Grab the shovel...

    Digging into the RF24.cpp file shows the RF24_getStatus() function called continuously at line 326. This pumps the SPI interface (sends 0xFF) to read the status register of the nRF24L01+ and see if the transmission is complete.

    A snippet of code from Github (“MySensors\hal\transport\driver\RF24.cpp”) shown below starting at line number 321

    Code Snippet 1

    // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
    	RF24_ce(HIGH);
    	// timeout counter to detect HW issues
    	uint16_t timeout = 0xFFFF;
    	do {
    		RF24_status = RF24_getStatus();
    	} while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
    	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
    	RF24_ce(LOW);
    	// reset interrupts
    	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
    	// Max retries exceeded
    	if(RF24_status & _BV(RF24_MAX_RT)) {
    		// flush packet
    		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
    		RF24_flushTX();
    

    We have found our smoking gun!

    “Solution”

    But wait! Don’t we need to check the transmission status? The answer is: kind of, but not like this. Ideally, the nRF24L01+ signals the gateway/sensor when transmission is finished and transmit success/failure can be determined by checking the transmission status. (Alternatively, you could just place a time delay after TX Enable that waits the max duration of your transmission and retries. However this is application dependent, and may not fully resolve the issue if you have many retries).

    A way that this could be achieved is through the use of the handy IRQ (Interrupt Request) line on the nRF240L01+. MySensors can make use of the IRQ line, however it does so in a different way than we want. MySensors uses the IRQ to signal the message handler to pull received messages from the nRF24L01+ internal FIFO to prevent overflow. This can be important for high traffic networks but sadly doesn’t help our situation.

    We decided to look at the datasheet (Image 9 below) for the nRF24L01+ for potential solutions and found a workaround.

    Image 9: Snippet from nRF24L01+ datasheet describing what the IRQ line can be attached to
    0_1559703016382_nRF24L01_Datasheet.png

    Great! TX_DS (TX Data Sent) and MAX_RT (Max TX retries reached) are the two flags we want to monitor to remedy our issue. It so happens, the IRQ line is setup to respond to these flags by default already! However, MySensors does not listen to the IRQ line during transmission (as shown in Code Snippet 1). So, let’s fix that!

    Below you can see the code with our “fix”.

    Code Snippet 2

    // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
    	RF24_ce(HIGH);
    	// timeout counter to detect HW issues
    	uint16_t timeout = 0xFFFF;
    	do {
    		//RF24_status = RF24_getStatus();
    		RF24_status = hwDigitalRead(MY_RF24_IRQ_PIN);
    	} while  (RF24_status && timeout--);
    	//}while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
    	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
    	RF24_status = RF24_getStatus();
    	RF24_ce(LOW);
    	// reset interrupts
    	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
    	// Max retries exceeded
    	if(RF24_status & _BV(RF24_MAX_RT)) {
    		// flush packet
    		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
    		RF24_flushTX();
    

    With the fix shown above, we recaptured transmissions from the same hardware shown in Image 5. Look how nice that data Eye pattern is! (as shown in Image 10 below).

    Image 10: Data Eye pattern with transmission IRQ fix
    0_1559703369150_Improved EYE.bmp

    WOW! That is an unbelievable improvement! Thankfully, all those “bad” or “questionable” sensor boards now work like a charm! They all successfully complete presentation and have very low message loss. Sadly, our custom PCB did not pin out the IRQ line so we had to solder a wire from IRQ to an unused pin.

    Image 11: Workaround wire
    0_1559703466787_Sensor_Rework.png

    Also, out of the 20ish nRF24L01+ boards all but one are the “clones” with the Shockburst bit inversion. So, this problem may not be as pronounced on genuine or “better clones”. However, it is generally good practice to quiet or mute unnecessary digital communication during transmit/receive if possible.

    HALP

    We are hardware folk by trade and things like ‘GitHub’ or “software best practices” are not our forte. (For Example, using a bus analyzer to sniff the SPI lines was much easier than digging into the code stack). If somebody was so willing, submitting this fix (or hopefully a better one!) would be great.

    mfalkviddM YveauxY 2 Replies Last reply
    14
    • odritterO odritter

      Project Background

      First, some background information on the project. We had prototyped an arduino and nRF24L01+ based sensor board using mysensors and home assistant. The main goal of this board was whole home temperature, humidity, and light level monitoring. The prototype was built from stand alone protoboards, breadboards, wires, etc. It worked great, so we decided to make a custom PCB that would have a socket for the arduino pro mini, a socket for the nRF24L01+ boards, and all of the other circuitry (sensors, power supply, etc).

      Image 1: Sensor Board! (Resistor for scale)
      0_1559701732624_DSC_4767.jpg

      Problem

      After building up a number of the “New and Improved!” custom PCBs, we noticed that some of them just didn’t work very well. We could read the data from all the sensors via the arduino, but they would fail in one of a few ways:

      • Not present with the gateway / Nothing seemingly happens.
      • Present but have terrible “message loss”. The original behavior was intermittent updates from the sensors board. To determine our “message loss” we created a debug program that would send a finite number of messages from the sensor board with an ack request and wait for the ack with a time out. The final “message loss” was (number of messages sent ) - (number of messages ack’d).

      Initial observations and debugging (in no particular order of desperation).

      • More bulk capacitance on the nRF24L01+ board. No luck. (BUT THE FORUMS PROMISED)
      • Changing power source (Batteries, Benchtop, etc) didn’t change behavior.
      • Changing the nRF24L01+ on our gateway (Raspberry Pi) didn’t change behavior.
      • A “problem” sensor board got “better” when we swapped out the nRF24L01+ board. (AHA! It must be the questionably authentic nRF24L01+ boards we ordered). So, we built a sensor board with a header socket so we could quickly screen/test all of our nRF24L01+ boards. This was ineffective and led to confusion (Wait, I thought you said this nRF24L01+ board worked?) as some nRF24L01+ boards worked in the screening unit, but then didn’t work in their final sensor board.
      • Changing distance between sensor boards and gateway didn’t change anything in a meaningful or coherent way.
      • Accidentally holding our thumb on the antenna changed the behavior. (Hey, wait, what just happened? It’s working! Wait.. no it isn't’.. What?)
      • Changing the data (symbol) rate of the nRF24L01+ board changed the behavior, but didn’t fix anything.
      • Changing the SPI speed also changed the behavior, but didn’t fix anything.
      • Changing the channel on the nRF24L01+ board helped at one person’s house but not other project member’s houses.
      • Manipulate ground planes on sensor board near antenna. (dremeled away).
      • Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

      {A year, a kid, and job changes later}

      Eventually, we decided the project had languished long enough. Enough sleep had been lost! We were going to figure out this problem!

      How

      Throwing stuff at the wall to see what stuck wasn't working, so we decided we needed to get serious and capture the wireless data. We borrowed a software defined radio (SDR) to capture the 2.4GHz spectrum with the intent of first demodulating the signal to see if that showed the smoking gun (It did but we didn't realize it at the time).

      Image 2: Demodulated Signal (Hey, that looks like a digital signal!)
      0_1559701980778_Carrier Startup.bmp

      Now that we can see the data to train on, we can decode the nRF24L01+ packet to see what’s going on. Specifically, the goal was to see if a packet was “good” or “bad” and our criteria for a “Good” packet was the Cyclic Redundancy Check (CRC) passing. (See image below).

      Image 3: Decoded NRF24L01+ Packet with CRC Check.
      0_1559702168258_nRF24L01_Packets.png

      Full of excitement and curiosity, we wielded our newfound RF power to decode the MySensors Payload as well (see Image 4 below) This didn’t prove to be very helpful for debug purposes, but it was interesting.

      Image 4: Decoded Mysensors Payload
      0_1559702307036_MySensors_Message.png

      Now that we can see “passing” and “failing” packets, we know we are on the right track. However, we need to measure the transmission “quality” beyond just pass/fail. Since the transmission is just binary data (using Frequency Modulation) we can parse the data to assemble the data Eye pattern (Image 5 below).

      Image 5: Impressively “Bad” data Eye pattern
      0_1559702381443_Original EYE.bmp

      I think I found the problem… But what is causing the Eye to close (good data Eye patterns are wide open in the middle)? Also, sometimes I get really good looking data Eyes! What is going on?!? Let’s have a closer look at a bad transmission. Images 6 and 7, shown below, are single transmissions taken from the multiple transmissions, previously shown in image 2.

      Image 6: A closer look at a single “bad” demodulated transmission (Wow, that’s bad!)
      0_1559702428757_Bad Continuous Transmission.bmp

      Image 7: A closer look at a single “good” demodulated transmission (That looks perfect! How is this the same board?)
      0_1559702454385_Good Continuous Transmission.bmp

      Comparing Images 6 and 7 is very interesting, we might be looking at the root cause here. The bad transmission looks like it has an “other” digital signal riding on top of it. What is going on during the nRF24L01+ transmission?

      We decided to grab a bus analyzer and record communication on the SPI bus during a transmission (see Image 8 below).

      Image 8: A Bus Analyzer capture during transmission (Wait a minute! That looks familiar!)
      0_1559702513350_DataBusCapture.jpg

      It seems strange that the SPI bus is constantly active during transmission. What is MySensors doing? Grab the shovel...

      Digging into the RF24.cpp file shows the RF24_getStatus() function called continuously at line 326. This pumps the SPI interface (sends 0xFF) to read the status register of the nRF24L01+ and see if the transmission is complete.

      A snippet of code from Github (“MySensors\hal\transport\driver\RF24.cpp”) shown below starting at line number 321

      Code Snippet 1

      // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
      	RF24_ce(HIGH);
      	// timeout counter to detect HW issues
      	uint16_t timeout = 0xFFFF;
      	do {
      		RF24_status = RF24_getStatus();
      	} while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
      	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
      	RF24_ce(LOW);
      	// reset interrupts
      	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
      	// Max retries exceeded
      	if(RF24_status & _BV(RF24_MAX_RT)) {
      		// flush packet
      		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
      		RF24_flushTX();
      

      We have found our smoking gun!

      “Solution”

      But wait! Don’t we need to check the transmission status? The answer is: kind of, but not like this. Ideally, the nRF24L01+ signals the gateway/sensor when transmission is finished and transmit success/failure can be determined by checking the transmission status. (Alternatively, you could just place a time delay after TX Enable that waits the max duration of your transmission and retries. However this is application dependent, and may not fully resolve the issue if you have many retries).

      A way that this could be achieved is through the use of the handy IRQ (Interrupt Request) line on the nRF240L01+. MySensors can make use of the IRQ line, however it does so in a different way than we want. MySensors uses the IRQ to signal the message handler to pull received messages from the nRF24L01+ internal FIFO to prevent overflow. This can be important for high traffic networks but sadly doesn’t help our situation.

      We decided to look at the datasheet (Image 9 below) for the nRF24L01+ for potential solutions and found a workaround.

      Image 9: Snippet from nRF24L01+ datasheet describing what the IRQ line can be attached to
      0_1559703016382_nRF24L01_Datasheet.png

      Great! TX_DS (TX Data Sent) and MAX_RT (Max TX retries reached) are the two flags we want to monitor to remedy our issue. It so happens, the IRQ line is setup to respond to these flags by default already! However, MySensors does not listen to the IRQ line during transmission (as shown in Code Snippet 1). So, let’s fix that!

      Below you can see the code with our “fix”.

      Code Snippet 2

      // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
      	RF24_ce(HIGH);
      	// timeout counter to detect HW issues
      	uint16_t timeout = 0xFFFF;
      	do {
      		//RF24_status = RF24_getStatus();
      		RF24_status = hwDigitalRead(MY_RF24_IRQ_PIN);
      	} while  (RF24_status && timeout--);
      	//}while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
      	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
      	RF24_status = RF24_getStatus();
      	RF24_ce(LOW);
      	// reset interrupts
      	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
      	// Max retries exceeded
      	if(RF24_status & _BV(RF24_MAX_RT)) {
      		// flush packet
      		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
      		RF24_flushTX();
      

      With the fix shown above, we recaptured transmissions from the same hardware shown in Image 5. Look how nice that data Eye pattern is! (as shown in Image 10 below).

      Image 10: Data Eye pattern with transmission IRQ fix
      0_1559703369150_Improved EYE.bmp

      WOW! That is an unbelievable improvement! Thankfully, all those “bad” or “questionable” sensor boards now work like a charm! They all successfully complete presentation and have very low message loss. Sadly, our custom PCB did not pin out the IRQ line so we had to solder a wire from IRQ to an unused pin.

      Image 11: Workaround wire
      0_1559703466787_Sensor_Rework.png

      Also, out of the 20ish nRF24L01+ boards all but one are the “clones” with the Shockburst bit inversion. So, this problem may not be as pronounced on genuine or “better clones”. However, it is generally good practice to quiet or mute unnecessary digital communication during transmit/receive if possible.

      HALP

      We are hardware folk by trade and things like ‘GitHub’ or “software best practices” are not our forte. (For Example, using a bus analyzer to sniff the SPI lines was much easier than digging into the code stack). If somebody was so willing, submitting this fix (or hopefully a better one!) would be great.

      mfalkviddM Offline
      mfalkviddM Offline
      mfalkvidd
      Mod
      wrote on last edited by mfalkvidd
      #2

      @odritter interesting analysis and fantastic writeup. I'll have to learn more about looking at data Eyes, seems like a very useful tool.

      The placement of the nrf's antenna on the board is very unusual. Most boards have the antenna away from surrounding conductive material (see https://www.openhardware.io/view/4/EasyNewbie-PCB-for-MySensors for example)

      How are the SPI lines placed on the board? Do they pass close to the antenna? or close to any other line that is not driven, which passes close to the antenna?

      1 Reply Last reply
      0
      • W Offline
        W Offline
        waspie
        wrote on last edited by
        #3

        this is great info, this could potentially help a lot of people on this forum

        1 Reply Last reply
        0
        • skywatchS Offline
          skywatchS Offline
          skywatch
          wrote on last edited by skywatch
          #4

          @odritter - Superb write up and sleuthing! - Thanks for sharing the results and findings.

          Do you believe that the interference is caused internally in the nrf24 chip due to spi activity during tx mode, or possibly power rail 'ringing' due to the spi being used at the same time as Tx is happening? - I would guess the former, but at such low voltage and currents strange things can happen.... ;)

          odritterO 1 Reply Last reply
          0
          • B Offline
            B Offline
            Ben036
            wrote on last edited by
            #5

            @mfalkvidd Hello! I am one of the other "we" mentioned in this post. I don't have access to the layout files at this moment, but I checked the Gerber files and none of the SPI lines are routed under (or near) the antenna. They are also not routed near any other lines that route under the antenna.

            "The placement of the nrf's antenna on the board is very unusual." Bad

            Yeah..... In our original goal/planning for this sensor board, we were going for a tight footprint that would sit inside a 3D printed enclosure easily and didn't think about the antenna being parallel to the PCB underneath it. Once we started having issues, we used a dremel to remove all the copper we could from the board below the antenna, but that didn't change the behavior we saw.

            Technically, we did that dremel experiment on a different board (we made two different designs) that was showing the same issue. Below, you can see an image of that experiment.

            0_1559746475298_IMG_20170306_192556064.jpg

            1 Reply Last reply
            1
            • skywatchS skywatch

              @odritter - Superb write up and sleuthing! - Thanks for sharing the results and findings.

              Do you believe that the interference is caused internally in the nrf24 chip due to spi activity during tx mode, or possibly power rail 'ringing' due to the spi being used at the same time as Tx is happening? - I would guess the former, but at such low voltage and currents strange things can happen.... ;)

              odritterO Offline
              odritterO Offline
              odritter
              wrote on last edited by
              #6

              @skywatch I don't know for sure where the interference is happening. I do know it isn't due to power supply bounce (at least on a board level) as I placed a variety of low ESR ceramics caps directly on the nRF board pins with zero change.

              My guess is the interference is primarily internal to the chip but I believe there is also a board level interaction somewhere (based on observed changes during experimentation). I did experiment slowing down the SPI lines coming from the Arduino board (series resistor with shunt cap at the Arduino pin) and it seemed to help a little bit but nowhere near the improvement with complete SPI muting. I would have done the same thing on the nRF board itself for MISO but the line pitch on that board is beyond my patience (and shaky hands). If SPI muting isn't a viable option for you for some reason, more experimentation might be needed to determine exactly where the problem is occurring.

              1 Reply Last reply
              1
              • skywatchS Offline
                skywatchS Offline
                skywatch
                wrote on last edited by
                #7

                @odritter Thanks - I have already put your changes into my RF24.cpp and will try it out later on.

                1 Reply Last reply
                0
                • odritterO odritter

                  Project Background

                  First, some background information on the project. We had prototyped an arduino and nRF24L01+ based sensor board using mysensors and home assistant. The main goal of this board was whole home temperature, humidity, and light level monitoring. The prototype was built from stand alone protoboards, breadboards, wires, etc. It worked great, so we decided to make a custom PCB that would have a socket for the arduino pro mini, a socket for the nRF24L01+ boards, and all of the other circuitry (sensors, power supply, etc).

                  Image 1: Sensor Board! (Resistor for scale)
                  0_1559701732624_DSC_4767.jpg

                  Problem

                  After building up a number of the “New and Improved!” custom PCBs, we noticed that some of them just didn’t work very well. We could read the data from all the sensors via the arduino, but they would fail in one of a few ways:

                  • Not present with the gateway / Nothing seemingly happens.
                  • Present but have terrible “message loss”. The original behavior was intermittent updates from the sensors board. To determine our “message loss” we created a debug program that would send a finite number of messages from the sensor board with an ack request and wait for the ack with a time out. The final “message loss” was (number of messages sent ) - (number of messages ack’d).

                  Initial observations and debugging (in no particular order of desperation).

                  • More bulk capacitance on the nRF24L01+ board. No luck. (BUT THE FORUMS PROMISED)
                  • Changing power source (Batteries, Benchtop, etc) didn’t change behavior.
                  • Changing the nRF24L01+ on our gateway (Raspberry Pi) didn’t change behavior.
                  • A “problem” sensor board got “better” when we swapped out the nRF24L01+ board. (AHA! It must be the questionably authentic nRF24L01+ boards we ordered). So, we built a sensor board with a header socket so we could quickly screen/test all of our nRF24L01+ boards. This was ineffective and led to confusion (Wait, I thought you said this nRF24L01+ board worked?) as some nRF24L01+ boards worked in the screening unit, but then didn’t work in their final sensor board.
                  • Changing distance between sensor boards and gateway didn’t change anything in a meaningful or coherent way.
                  • Accidentally holding our thumb on the antenna changed the behavior. (Hey, wait, what just happened? It’s working! Wait.. no it isn't’.. What?)
                  • Changing the data (symbol) rate of the nRF24L01+ board changed the behavior, but didn’t fix anything.
                  • Changing the SPI speed also changed the behavior, but didn’t fix anything.
                  • Changing the channel on the nRF24L01+ board helped at one person’s house but not other project member’s houses.
                  • Manipulate ground planes on sensor board near antenna. (dremeled away).
                  • Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

                  {A year, a kid, and job changes later}

                  Eventually, we decided the project had languished long enough. Enough sleep had been lost! We were going to figure out this problem!

                  How

                  Throwing stuff at the wall to see what stuck wasn't working, so we decided we needed to get serious and capture the wireless data. We borrowed a software defined radio (SDR) to capture the 2.4GHz spectrum with the intent of first demodulating the signal to see if that showed the smoking gun (It did but we didn't realize it at the time).

                  Image 2: Demodulated Signal (Hey, that looks like a digital signal!)
                  0_1559701980778_Carrier Startup.bmp

                  Now that we can see the data to train on, we can decode the nRF24L01+ packet to see what’s going on. Specifically, the goal was to see if a packet was “good” or “bad” and our criteria for a “Good” packet was the Cyclic Redundancy Check (CRC) passing. (See image below).

                  Image 3: Decoded NRF24L01+ Packet with CRC Check.
                  0_1559702168258_nRF24L01_Packets.png

                  Full of excitement and curiosity, we wielded our newfound RF power to decode the MySensors Payload as well (see Image 4 below) This didn’t prove to be very helpful for debug purposes, but it was interesting.

                  Image 4: Decoded Mysensors Payload
                  0_1559702307036_MySensors_Message.png

                  Now that we can see “passing” and “failing” packets, we know we are on the right track. However, we need to measure the transmission “quality” beyond just pass/fail. Since the transmission is just binary data (using Frequency Modulation) we can parse the data to assemble the data Eye pattern (Image 5 below).

                  Image 5: Impressively “Bad” data Eye pattern
                  0_1559702381443_Original EYE.bmp

                  I think I found the problem… But what is causing the Eye to close (good data Eye patterns are wide open in the middle)? Also, sometimes I get really good looking data Eyes! What is going on?!? Let’s have a closer look at a bad transmission. Images 6 and 7, shown below, are single transmissions taken from the multiple transmissions, previously shown in image 2.

                  Image 6: A closer look at a single “bad” demodulated transmission (Wow, that’s bad!)
                  0_1559702428757_Bad Continuous Transmission.bmp

                  Image 7: A closer look at a single “good” demodulated transmission (That looks perfect! How is this the same board?)
                  0_1559702454385_Good Continuous Transmission.bmp

                  Comparing Images 6 and 7 is very interesting, we might be looking at the root cause here. The bad transmission looks like it has an “other” digital signal riding on top of it. What is going on during the nRF24L01+ transmission?

                  We decided to grab a bus analyzer and record communication on the SPI bus during a transmission (see Image 8 below).

                  Image 8: A Bus Analyzer capture during transmission (Wait a minute! That looks familiar!)
                  0_1559702513350_DataBusCapture.jpg

                  It seems strange that the SPI bus is constantly active during transmission. What is MySensors doing? Grab the shovel...

                  Digging into the RF24.cpp file shows the RF24_getStatus() function called continuously at line 326. This pumps the SPI interface (sends 0xFF) to read the status register of the nRF24L01+ and see if the transmission is complete.

                  A snippet of code from Github (“MySensors\hal\transport\driver\RF24.cpp”) shown below starting at line number 321

                  Code Snippet 1

                  // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
                  	RF24_ce(HIGH);
                  	// timeout counter to detect HW issues
                  	uint16_t timeout = 0xFFFF;
                  	do {
                  		RF24_status = RF24_getStatus();
                  	} while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
                  	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
                  	RF24_ce(LOW);
                  	// reset interrupts
                  	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
                  	// Max retries exceeded
                  	if(RF24_status & _BV(RF24_MAX_RT)) {
                  		// flush packet
                  		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
                  		RF24_flushTX();
                  

                  We have found our smoking gun!

                  “Solution”

                  But wait! Don’t we need to check the transmission status? The answer is: kind of, but not like this. Ideally, the nRF24L01+ signals the gateway/sensor when transmission is finished and transmit success/failure can be determined by checking the transmission status. (Alternatively, you could just place a time delay after TX Enable that waits the max duration of your transmission and retries. However this is application dependent, and may not fully resolve the issue if you have many retries).

                  A way that this could be achieved is through the use of the handy IRQ (Interrupt Request) line on the nRF240L01+. MySensors can make use of the IRQ line, however it does so in a different way than we want. MySensors uses the IRQ to signal the message handler to pull received messages from the nRF24L01+ internal FIFO to prevent overflow. This can be important for high traffic networks but sadly doesn’t help our situation.

                  We decided to look at the datasheet (Image 9 below) for the nRF24L01+ for potential solutions and found a workaround.

                  Image 9: Snippet from nRF24L01+ datasheet describing what the IRQ line can be attached to
                  0_1559703016382_nRF24L01_Datasheet.png

                  Great! TX_DS (TX Data Sent) and MAX_RT (Max TX retries reached) are the two flags we want to monitor to remedy our issue. It so happens, the IRQ line is setup to respond to these flags by default already! However, MySensors does not listen to the IRQ line during transmission (as shown in Code Snippet 1). So, let’s fix that!

                  Below you can see the code with our “fix”.

                  Code Snippet 2

                  // go, TX starts after ~10us, CE high also enables PA+LNA on supported HW
                  	RF24_ce(HIGH);
                  	// timeout counter to detect HW issues
                  	uint16_t timeout = 0xFFFF;
                  	do {
                  		//RF24_status = RF24_getStatus();
                  		RF24_status = hwDigitalRead(MY_RF24_IRQ_PIN);
                  	} while  (RF24_status && timeout--);
                  	//}while  (!(RF24_status & ( _BV(RF24_MAX_RT) | _BV(RF24_TX_DS) )) && timeout--);
                  	// timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
                  	RF24_status = RF24_getStatus();
                  	RF24_ce(LOW);
                  	// reset interrupts
                  	RF24_setStatus(_BV(RF24_TX_DS) | _BV(RF24_MAX_RT) );
                  	// Max retries exceeded
                  	if(RF24_status & _BV(RF24_MAX_RT)) {
                  		// flush packet
                  		RF24_DEBUG(PSTR("!RF24:TXM:MAX_RT\n"));	// max retries, no ACK
                  		RF24_flushTX();
                  

                  With the fix shown above, we recaptured transmissions from the same hardware shown in Image 5. Look how nice that data Eye pattern is! (as shown in Image 10 below).

                  Image 10: Data Eye pattern with transmission IRQ fix
                  0_1559703369150_Improved EYE.bmp

                  WOW! That is an unbelievable improvement! Thankfully, all those “bad” or “questionable” sensor boards now work like a charm! They all successfully complete presentation and have very low message loss. Sadly, our custom PCB did not pin out the IRQ line so we had to solder a wire from IRQ to an unused pin.

                  Image 11: Workaround wire
                  0_1559703466787_Sensor_Rework.png

                  Also, out of the 20ish nRF24L01+ boards all but one are the “clones” with the Shockburst bit inversion. So, this problem may not be as pronounced on genuine or “better clones”. However, it is generally good practice to quiet or mute unnecessary digital communication during transmit/receive if possible.

                  HALP

                  We are hardware folk by trade and things like ‘GitHub’ or “software best practices” are not our forte. (For Example, using a bus analyzer to sniff the SPI lines was much easier than digging into the code stack). If somebody was so willing, submitting this fix (or hopefully a better one!) would be great.

                  YveauxY Offline
                  YveauxY Offline
                  Yveaux
                  Mod
                  wrote on last edited by
                  #8

                  @odritter What puzzles me is that the interference in image 6 apparently isn't always present (see image 7), although the behavior of the MySensors stack is always identical (see image 8 and the matching code snippet 1).
                  Do you see a pattern when it is present and when not?

                  Also, switching to using the nRF24 interrupt line will break MySensors for a lot of (existing) boards, that don't have the IRQ line connected.
                  So, if we decide to add a silent period to the stack, we also need a non-IRQ implementation based on e.g. a delay.

                  http://yveaux.blogspot.nl

                  tekkaT alowhumA 2 Replies Last reply
                  0
                  • mfalkviddM Offline
                    mfalkviddM Offline
                    mfalkvidd
                    Mod
                    wrote on last edited by
                    #9

                    Yes breaking the sketches for 90% or more of MySensors users (since non-irq is the default setup) would be very bad. But we should be able to use irq in case the user has defined MY_RF24_IRQ_PIN (just like we already do for RX).

                    1 Reply Last reply
                    1
                    • YveauxY Yveaux

                      @odritter What puzzles me is that the interference in image 6 apparently isn't always present (see image 7), although the behavior of the MySensors stack is always identical (see image 8 and the matching code snippet 1).
                      Do you see a pattern when it is present and when not?

                      Also, switching to using the nRF24 interrupt line will break MySensors for a lot of (existing) boards, that don't have the IRQ line connected.
                      So, if we decide to add a silent period to the stack, we also need a non-IRQ implementation based on e.g. a delay.

                      tekkaT Offline
                      tekkaT Offline
                      tekka
                      Admin
                      wrote on last edited by
                      #10

                      Here is a modified RF24 stack with (among other little changes) a waiting period and no IRQ line (as @Yveaux suggested) for testing: https://github.com/tekka007/MySensors/tree/OptimizedRF24polling

                      @odritter

                      Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

                      Is this setup referring to image 7?

                      odritterO 2 Replies Last reply
                      3
                      • HomerH Offline
                        HomerH Offline
                        Homer
                        wrote on last edited by
                        #11

                        Great info, and very well written!!! Thanks for sharing.

                        1 Reply Last reply
                        0
                        • tekkaT tekka

                          Here is a modified RF24 stack with (among other little changes) a waiting period and no IRQ line (as @Yveaux suggested) for testing: https://github.com/tekka007/MySensors/tree/OptimizedRF24polling

                          @odritter

                          Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

                          Is this setup referring to image 7?

                          odritterO Offline
                          odritterO Offline
                          odritter
                          wrote on last edited by
                          #12

                          @Yveaux To be clear, I realize my code snippet "fix" would break anyone who doesn't already have the IRQ hooked up. This is where we are asking for a contributor more experienced with generalizing to the MySensors library and looking for #defines or such to play nice.

                          As for what is different between image 6 and image 7 I admit our description in that section gets a little hand wavy for brevity. Image 7 is an auto-ACK from the nRF24L01+ ShockBurst where Image 6 is an outgoing message. You can see evidence of this in image 3 where outgoing messages (like image 6) have a payload (containing MySensors info) and auto-ACKs have no payload (shown as a grey'd out array).

                          Once I saw that the hardware transmission can be pristine it started me thinking what was so different between a normal message and an auto-ACK. SPI communication.

                          A word of caution reading too much into early debug steps mentioned (just before image 2). We decided to include these to give background information on what kinds of troubleshooting steps we tried. Some were tried out of sleep deprived desperation and what appeared to "fix" the problem at the time may have only modified the conditions that didn't work. We were not into "rigorous testing mode" at this point in time. We were blindly stabbing in the dark hoping something would work. Additionally, these early debug steps were done without any knowledge of other spectral content in the environment. Once we started using the SDR we observed the spectrum first and moved the nRF channel in a clear and free band to ensure our measurements were only of the nRF boards.

                          YveauxY 1 Reply Last reply
                          1
                          • tekkaT tekka

                            Here is a modified RF24 stack with (among other little changes) a waiting period and no IRQ line (as @Yveaux suggested) for testing: https://github.com/tekka007/MySensors/tree/OptimizedRF24polling

                            @odritter

                            Using the previously mentioned sensor board with a header socket, we observed that almost any nRF24L01+ board worked when cabled (~6 inches) rather than directly plugging into the sensor board. (BUT WHY?????)

                            Is this setup referring to image 7?

                            odritterO Offline
                            odritterO Offline
                            odritter
                            wrote on last edited by
                            #13

                            @tekka I peaked at the delay you added to the RF24.cpp file. It looks like your calculation is not considering auto-ACK unless I am missing something. I have given some thought to what I think needs to be considered to completely avoid all possible transmissions. See below

                            optimal delay with ShockBurst = TX state transition delay + (nRF packet length * DataRate + auto-ACK timeout) * (ACKretries + 1)
                            optimal delay without SB = TX state transition delay + nRF packet length * DataRate

                            The problem I ran into doing a static delay is that the delay time can get quite large if considering ACK retires making the system very slow. Reducing the amount of polling should help interference even if all SPI communication isn't avoided during every possible transmission. I would definitely prefer the IRQ line used if defined and if not then use a delay (either with or without retires in mind).

                            1 Reply Last reply
                            0
                            • odritterO odritter

                              @Yveaux To be clear, I realize my code snippet "fix" would break anyone who doesn't already have the IRQ hooked up. This is where we are asking for a contributor more experienced with generalizing to the MySensors library and looking for #defines or such to play nice.

                              As for what is different between image 6 and image 7 I admit our description in that section gets a little hand wavy for brevity. Image 7 is an auto-ACK from the nRF24L01+ ShockBurst where Image 6 is an outgoing message. You can see evidence of this in image 3 where outgoing messages (like image 6) have a payload (containing MySensors info) and auto-ACKs have no payload (shown as a grey'd out array).

                              Once I saw that the hardware transmission can be pristine it started me thinking what was so different between a normal message and an auto-ACK. SPI communication.

                              A word of caution reading too much into early debug steps mentioned (just before image 2). We decided to include these to give background information on what kinds of troubleshooting steps we tried. Some were tried out of sleep deprived desperation and what appeared to "fix" the problem at the time may have only modified the conditions that didn't work. We were not into "rigorous testing mode" at this point in time. We were blindly stabbing in the dark hoping something would work. Additionally, these early debug steps were done without any knowledge of other spectral content in the environment. Once we started using the SDR we observed the spectrum first and moved the nRF channel in a clear and free band to ensure our measurements were only of the nRF boards.

                              YveauxY Offline
                              YveauxY Offline
                              Yveaux
                              Mod
                              wrote on last edited by
                              #14

                              @odritter said in nRF24L01+ Communication Failure: Root Cause and “Solution”:

                              it started me thinking what was so different between a normal message and an auto-ACK. SPI communication.

                              And the all important fact that the ack message on air comes from the receiving node, instead of the sending one.
                              My gut feeling tells me you are masking a hardware (design) flaw with software. If it helps in your case, good for you, but I'd like to understand it completely before absorbing it in the stack.

                              http://yveaux.blogspot.nl

                              odritterO B 3 Replies Last reply
                              0
                              • skywatchS Offline
                                skywatchS Offline
                                skywatch
                                wrote on last edited by skywatch
                                #15

                                @tekka I have been testing this (2.3.2b) on a node for the last 9.5 hours and so far no problems.

                                I will probably flash the GW with this today and see how that goes, not expecting any issues though! ;)

                                One question though..... Does this 'fix' also apply to when the nrf is in Rx mode as well? If there is no spi activity during Rx then I guess it's a moot point. But I'd be interested in the answer.

                                odritterO 1 Reply Last reply
                                0
                                • YveauxY Yveaux

                                  @odritter said in nRF24L01+ Communication Failure: Root Cause and “Solution”:

                                  it started me thinking what was so different between a normal message and an auto-ACK. SPI communication.

                                  And the all important fact that the ack message on air comes from the receiving node, instead of the sending one.
                                  My gut feeling tells me you are masking a hardware (design) flaw with software. If it helps in your case, good for you, but I'd like to understand it completely before absorbing it in the stack.

                                  odritterO Offline
                                  odritterO Offline
                                  odritter
                                  wrote on last edited by
                                  #16
                                  This post is deleted!
                                  1 Reply Last reply
                                  0
                                  • YveauxY Yveaux

                                    @odritter said in nRF24L01+ Communication Failure: Root Cause and “Solution”:

                                    it started me thinking what was so different between a normal message and an auto-ACK. SPI communication.

                                    And the all important fact that the ack message on air comes from the receiving node, instead of the sending one.
                                    My gut feeling tells me you are masking a hardware (design) flaw with software. If it helps in your case, good for you, but I'd like to understand it completely before absorbing it in the stack.

                                    odritterO Offline
                                    odritterO Offline
                                    odritter
                                    wrote on last edited by
                                    #17

                                    @yveaux I agree this fix may be covering up a problem in our design. As @mfalkvidd mentioned in his post we did put the antenna in an unusual location (not ideal). So it is entirely possible our problems are caused by the antenna placement, nRF clones, or something else entirely. However, we believe we are not alone in the issues we encountered or any design flaw we made.

                                    That being said, constantly polling during TX transmission should be avoided as it is unnecessary. MySensors can calculate how long a transmission should take and hold off for a least the first (or all possible) transmission(s). Implementing a delay should be minimal risk and can only benefit. Implementing the IRQ is more involved to implement properly and higher risk (though I think worth it).

                                    YveauxY 1 Reply Last reply
                                    1
                                    • skywatchS skywatch

                                      @tekka I have been testing this (2.3.2b) on a node for the last 9.5 hours and so far no problems.

                                      I will probably flash the GW with this today and see how that goes, not expecting any issues though! ;)

                                      One question though..... Does this 'fix' also apply to when the nrf is in Rx mode as well? If there is no spi activity during Rx then I guess it's a moot point. But I'd be interested in the answer.

                                      odritterO Offline
                                      odritterO Offline
                                      odritter
                                      wrote on last edited by
                                      #18

                                      @skywatch This fix only applies to the TX for three reasons.

                                      1. Our method for measuring transmission quality can only observe TX (not RX) so we would need a different way to assess RX
                                      2. TX timing is well known from start to finish so muting the communication while it is happening is relatively straightforward. RX is a different story
                                      3. MySensors already implements IRQ for RX and likely already gains whatever benefit there is to be had limiting communication during RX (though I didn't look into this extensively to confirm)

                                      I am open to ideas on how to assess the RX side if anyone has any suggestions. However for us, implementing the IRQ fix for TX made our boards go from barely working to working like a champ!

                                      1 Reply Last reply
                                      2
                                      • skywatchS Offline
                                        skywatchS Offline
                                        skywatch
                                        wrote on last edited by
                                        #19

                                        @odritter Thanks for the clarification! - I learn something new (again)... :)

                                        My question was based on the fact that if enough energy is present in the spi signal to imprint on the Tx output, then it would be even worse for the Rx side as the levels of signal received would be much lower than anything transmitted.

                                        odritterO 1 Reply Last reply
                                        0
                                        • odritterO odritter

                                          @yveaux I agree this fix may be covering up a problem in our design. As @mfalkvidd mentioned in his post we did put the antenna in an unusual location (not ideal). So it is entirely possible our problems are caused by the antenna placement, nRF clones, or something else entirely. However, we believe we are not alone in the issues we encountered or any design flaw we made.

                                          That being said, constantly polling during TX transmission should be avoided as it is unnecessary. MySensors can calculate how long a transmission should take and hold off for a least the first (or all possible) transmission(s). Implementing a delay should be minimal risk and can only benefit. Implementing the IRQ is more involved to implement properly and higher risk (though I think worth it).

                                          YveauxY Offline
                                          YveauxY Offline
                                          Yveaux
                                          Mod
                                          wrote on last edited by
                                          #20

                                          @odritter said in nRF24L01+ Communication Failure: Root Cause and “Solution”:

                                          constantly polling during TX transmission should be avoided as it is unnecessary

                                          There is no mention in the nRF24L01+ datasheet of (potential) issues caused by SPI transfers during TX, so where is your statement based on?

                                          http://yveaux.blogspot.nl

                                          odritterO 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          31

                                          Online

                                          11.7k

                                          Users

                                          11.2k

                                          Topics

                                          113.1k

                                          Posts


                                          Copyright 2025 TBD   |   Forum Guidelines   |   Privacy Policy   |   Terms of Service
                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • MySensors
                                          • OpenHardware.io
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular