[RFC] Improve package delivery for RF24 modules



  • Hi all, @Necromant has created an interesting PR #1477 in the MySensors repo with request for comments. It is about packet transport reliability for nRF24L01+ modules and how to improve it.

    I have a similar setup as in PR with gateway, repeater and about 20 sensor nodes. To make it not too easy, there are two ping-pong nodes, which constantly send each other telegrams of maximum length with life sign signal value.

    I know since longer time that the main reason for lost telegrams in my setup is the collision of two nodes, each trying to send a telegram to the other. Then both repeat the telegram 15 times but both nodes have no success with it. The other node would have to be in receive mode, but instead to listen it tries to send it's own telegram multiple times - a classic deadlock.

    In worst case other nodes then also try to transmit and sometimes (every few days) this causes a traffic jam on the airwaves with a longer failure of the gateway. The error light then flashes constantly and nothing works anymore.

    I see two different RF24 problems which may be resolvable by the PR:

    1. Occasional traffic jam on the airwaves
    2. Packets lost if two nodes try to send almost simultaneously to each other

    So, your comments are requested - Thank you



  • @virtualmkr An insteresing read about a problem that affects some people.

    I had a similar first reaction to @mfalkvid, that setting a random time to wiat for each resend might help.

    The solution I have used is to have multiple gateways to reduce traffic and chance of collisions. It only needs a pro mini and a NRF, so not a huge outlay for a simple solution. But I agree that if it can be improved internally than it would be a good thing.

    The NRF's can listen on 6 'pipes' simultaneously, so maybe a second pipe' could be used as a network engaged signal?

    I guess we also need to consider that mysensors is not just nrf but aims to be radio agnositc as well as any impact on power usage for battery nodes.



  • Okay, reporting in to the discussion thread from the github PR.

    My setup is a bit different from @virtualmkr, since I don't have any N2N traffic in my network, all logic is handled by the controller (HomeAssistant). Summing up the details, here's what I've learned so far that affects packet loss on my setup.

    1. Time to switch from receiving to sending. With a repeater (e.g. 328p @12Mhz) buffering RX packets before sending, this takes a while. Suggested solutions: - max out MY_RF24_SET_ARD register value, implement exponential backoff when sending. (How to actually implement it in mysensors code - that's a big question)

    2. Interference from wifi/zigbee (including sidelobes!). This is pretty easy to solve, if you don't have too many neighbors. My DIY usb<-->nRF24L01 dongles can work by sweeping all available channels looking for carrier frequency. It works like a poor man's 2.4Ghz spectrum analyzer. Once looking at the picture, I moved my WiFi to 6 & 11 channels, and used the lower part of the spectrum for mysensors & zigbee, making enough space in-between for the sidelobes to fit in.

    3. Switching several actuators in one node very fast. I have LED dimmers that have 4 channels. If I switch them all at once in home assistant - some packets are lost. My guess is that once 3 packets are inside the RX FIFO of the nRF24 subsequent packets are lost. And since the gateway is an STM32 it can send those waaay faster. Maxing out MY_RF24_SET_ARD register value may help, I still have to try it.

    Hardware issues:

    1. Crappy antennas/design on modules, suffering from 'magic finger' problem. I've written about these in my blog. Can be fixed with some manual labor: https://ncrmnt.org/2021/01/11/nrf24l01-manually-calibrating-the-antenna-with-mysensors-and-homeassistant/

    2. Weather. Wet snow kills 2.4Ghz signals, if you're operating outside.

    3. Antenna direction. Especially if the gateway has a PA+LNA module and an external SMA antenna.



  • Hi @Necromant, welcome to the forum!
    BTW From your profile image is it this one?
    2021-03-02 18_43_05-Window.png
    This is a SN7440 clone from Soviet times, right?

    But back to topic and your issue 1:
    A delay before resending is required for sure. A wait() is not a good idea because it introduces recursion of unknown deep, depending from the users receive() implementation.

    But a short delay() is possible. While delay() the RF24 is in receive mode and can actually receive the other nodes packet which has collided before.

    But exponential backoff is also not a good idea, because this can become a long delay of a second or more which blocks all time the main loop. But a short random delay of some 10th ms (like mentioned by @skywatch) worked for me in my setup.

    But instead of speculating, we should try out the ideas on a test setup in a comprehensible way. But we should do that outside the official MySensors repo.

    I will prepare a branch with the necessary core changes in my MySensors clone repo and also create test projects. I will start with a gateway and a sensor node.

    Then we can check your topic 3 with sending 4 or more packets in a row from gateway (I will use a ESP8266) to a slow 12MHz Nano clone.



  • @virtualmkr

    This is a SN7440 clone from Soviet times, right?

    Exactly. Still have a handful of those sitting somewhere in the closet.

    A wait() is not a good idea because it introduces recursion of unknown deep, depending from the users receive() implementation.

    Yep, I thought the same. Something like (ARD * 15 / 2) seems like a good delay.

    But exponential backoff is also not a good idea, because this can become a long delay of a second or more which blocks all time the main loop. But a short random delay of some 10th ms (like mentioned by @skywatch) worked for me in my setup.

    Exponential backoff would be a nice feature for NodeManager, but that would only take care to deliver sensor data to the uplink (unless we ack data back all the time). This wouldn't help if the traffic jam occurs somewhere further up.

    Another idea I thought of, would be listening if any carrier is present. (nRF24L01+ has that feature) and sending any packets only once there's no carrier + adding a random delay. This would hopefully avoid collisions, even with other things, like WiFi.



  • @Necromant said in [RFC] Improve package delivery for RF24 modules:

    Another idea I thought of, would be listening if any carrier is present. (nRF24L01+ has that feature) and sending any packets only once there's no carrier + adding a random delay. This would hopefully avoid collisions, even with other things, like WiFi.

    Yes, this sounds interesting. Do you think of the RPD register value (Received Power Detector) from RF24?
    But it will only work with the RF24 transport drivers (like @skywatch remarked above).

    Meanwhile I'm ready with the first setup to produce a reproduceable packet lost with your idea of sending multiple messages immediately one after other. The test project repo with gateway and one node is:
    MySensors.IssueProjects

    The MySensors core changes are in my project fork branch:
    topic/mkr/issueTransportHalRetry

    Unfortunately the #1477/Setting-01 solution don't produces any lost packet with ESP8266/160MHz as gateway and Nano/12MHz as node. So I would say that a fast gateway, sending multiple messages to same node in a sequence without wait() in between, is not the reason for lost packages. I have checked with up to 256 packets in a row. All perfectly transferred without any one lost.

    My next try is to use two nodes which send messages at the same time to gateway. I think of somehow synchronize the nodes by a wire between GPIOs. I will let you know.

    But maybe someone has a better idea for a setup which produces a packet lost for sure?



  • @virtualmkr I have your sketches a quick look. Seems like you have debug enabled on the gateway. Esp8266 has to deal with WiFi stack &mqtt handling despite running on pretty high speed. This may introduce a lot of delays.
    And my wild guess would be that that is enough for m328p to chew up.

    I'll set up my 'staging' gateway in a radio noisy environment and give your test a spin this weekend.



  • @Necromant You are right, an ESP8266 with WiFi may behave differently than an STM32. Unfortunately I don't have a working STM32 as gateway.

    Meanwhile I created two additional test settings with gateway and two nodes, where the used gateway type is not important. The test settings are available in my repo MySensors.IssueProjects.

    • Setting-02 creates a race condition of two nodes sending a message to the gateway at the same time.
    • Setting-03 creates a conflict when both nodes send a message to each other at the same moment. For that one node is in repeater mode and the other node is associated to the repeater.

    The approach of synchronizing the two nodes via GPIOs and a wire to send at the same moment works very well and reproducibly in Setting-02. The only problem was when all the sent messages in a row fail, that then the internal self-healing mechanism of MySensor's transport logic reinitializes the radio. To avoid this, I added logic that sends a successful message after 4 failed messages. With this the MySensors self-healing mechanism is then satisfied.

    Setting-03 does not work properly yet, because the N2N logic causes every message from node to the repeater to be sent twice.

    I will create an issue about this in the MySensors repo.



  • This is how Setting-02 looks like with my good old Saleae LA when both nodes send at the same moment:
    2021-03-08 20_10_35-Saleae Logic Software.png
    You can see, both nodes send the message 16 times because ARD is 15 by default.
    And both without success.



  • @virtualmkr Great work. Meanwhile, I have updated my homeassistant installation and set up a second, 'staging' network with the modules I have around. I think I can arrange remote access to this setup later, if that's needed.

    It seems to me that the issue I've been experiencing is partially related to HomeAssistant's way of working with mysensors actuators. If I 'gang-switch' a bunch of lights I see the following in the log: (note the 7;X;1;1;2;1)

    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 7 child 8
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;3;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;4;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;5;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;6;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;7;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [mysensors.transport] Sending 7;8;1;1;2;1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Red Wisp 7 8: value_type 2, value = 1
    Mar 08 19:14:20 bladeling hass[26390]: 2021-03-08 19:14:20 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Red Wisp 7 8: value_type 3, value = 100
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;2;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;2;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;2;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;3;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;3;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Strip Wisp 4 2: value_type 2, value = 0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Strip Wisp 4 2: value_type 3, value = 0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 4;2;1;0;3;0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 4 child 2
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 7;253;1;0;37;-45
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 7 child 253
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [mysensors.transport] Receiving 7;253;1;0;37;-45
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.gateway] Node update: node 7 child 253
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Strip Wisp 4 2: value_type 2, value = 0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Strip Wisp 4 2: value_type 3, value = 0
    Mar 08 19:14:21 bladeling hass[26390]: 2021-03-08 19:14:21 DEBUG (MainThread) [homeassistant.components.mysensors.device] Entity update: Red Wisp 7 253: value_type 37, value = -45
    

    HomeAssistant requests acks for all packets and dumps them all at once to the gateway. With half-duplex nodes this is just asking for trouble. I'm diving right now into hass component code to see if I can add a little extra debugging.



  • @virtualmkr Thanks for the nice pictures. With that in mind I'm pretty much sure I'm getting a clash with the acks going back the the gw somewhere on the way, especially when a repeater or two is involved.



  • Hi @necromant, thanks for your HA setup and investigations. Yes, when a message is sent to an actuator, it needs to respond with the new status for HA. If you control multiple actuators quickly one after the other it creates perfect traffic jam in the MySensors network.

    Now we just have to find an algorithm how to best resolve these collisions.
    Your HA project is then a perfect real-world test for the algorithm.



  • I see 2 ways and one hacky way:

    • Agressive buffering with some delays, e.g. only switch modes after N ms after last packet is received, so that repeaters absorb all data bursts. Perhaps even TX buffering.
    • Adjust HomeAssistant controller code: It should not request ack, but instead expect devices to send back the new state some time soon after flipping state (and retry those packets)
    • The hacky way: Make HomeAssistant wait for the ack, before sending the next command for the node. Perhaps the easiest, but we'll have to bother some of the devs working on that integration.


  • I gave a the HAL code a more detailed review, so I think there's a possibility to implement something like a simple 'collision avoidance' using RPD register. First, here's some info from the datasheet:
    f5ece504-87a9-4736-a3e9-5c60587e26c3-image.png

    Now, let's do some calculations.
    We have ~42 byte packets max (1 + 5 + 2 + 32 + 2). These take 42/250000.0 * 1000 * 1000 = 168 uS of radio time at 250000. With that in mind, I'd try something like this.

    LOCAL bool RF24_sendMessage(const uint8_t recipient, const void *buf, const uint8_t len,
                                const bool noACK)
    {
    int retry = 5;
    while (retry--) {
       RF24_stopListening();
       if (RF24_testRPD()) { //Something was talking on the radio, we have to wait for a while
         RF24_startListening(); //Start listening again and wait.
         delay_us(180 * 2);  // Delay enough time to chew up at least 2 radio packets at 250Kbps
       }
    

    But that would be RF24-specific. Another idea is making the delays dependent on NODE_ID and implement something like a simple bus arbiter, so that nodes with lower NODE_ID have more priority. I will only be able to give this a shot on the weekend, so feel free to try it out.



  • Hi @Necromant, thank you for your comments regarding the RPD feature of the nRF24+.

    I have done some experiments with it. The result is a new tool TrafficDetectorRF24, which is available in my MySensors.Tools repository. The tool scans a single channel and outputs the current status via a debug pin. This can be used to connect a LED or better an input of a logic analyser.

    At first I tried to use the RPD feature based on the MySensors example PassiveNode.
    Unfortunately, that didn't work at all for me. I then adapted the code from Rolf Henkel Poor Man's Wireless 2.4GHz Scanner for my purposes.

    The detector works quite accurately (resolution approx. 140us) so that you can usually detect the transmitted telegram and the ACK response of the receiver individually:

    2021-03-13 23_16_23-Clipboard.png

    I hope you have more success in your attempts with the RPD feature.


Log in to reply
 

Suggested Topics

  • 1
  • 2
  • 10
  • 6
  • 5
  • 3

0
Online

11.4k
Users

11.1k
Topics

112.7k
Posts