Node stops receiving after some time when using MY_RX_MESSAGE_BUFFER_FEATURE
My NRF24L01+ based MySensors network is struggling with nodes that randomly stop receiving messages. Until this week I was not able to find some pattern in this behaviour.
Now I think I found a link: The nodes that use the MY_RX_MESSAGE_BUFFER_FEATURE with the connected IRQ pin of the transceiver all fail after some time. They are still able to send messages but don't receive anything anymore. Only a reboot fixes the problem. I found that I can replicate this behaviour when I'm sending messages to more than one Child ID of the node in fast succession. But even without doing this the node fails after some time.
I have played around with different buffer size values (between 10 and 35). It seems like the node is working longer with a higher buffer size.
As soon as I disable the buffer feature the node works OK and does not stop receiving.
I can observe this behaviour with version 2.2.0, 2.3.0 and the test version of the 2.3.0 with the modified CE handling. Also on my gateway I am using the PA version of the transceiver, on all of my nodes I have the normal version with PCB antenna.
Does anybody also have the same problem? Is there maybe a bug in the buffer code? I really would like to use this feature as I have a much quicker and more reliable acknowledgement handling as long as the nodes are running.
@mathea90 I've used it for over 2 years now on my ethernet gateway and it never failed me.
I can only imagine that it stops receiving because the stack missed some interrupts from the radio, causing the internal radio queue to be filled. When this queue is totally filled any new incoming messages will be dropped by the radio.
I don't know why the stack would miss interrupts from the radio though; I assume cabling (especially the interrupt line) is solid. I don't think size of the buffer matters, unless you use slow hardware at a very high message rate.
What arduino are you using for the nodes?
Are you sending at a very high rate?
Could you post node logs and try with serial debug output disabled?
@yveaux Thanks for your answer. I am using my own designed PCBs with a bare Atmega328 soldered onto it. The interrupt traces are short with no sharp corners. I have this problem across multiple PCB layouts so I would rule out a design fault. Some of my boards run the Atmega on its internal oscillator at 8MHz and some have an external 16MHz crystal. Both with the same behaviour.
My gateway is also based on the Atmega328, coupled with a Wiznet W5500 ethernet module. As the NRF24L01+ has to be driven by soft-spi in this configuration, the IRQ pin is not used on my gateway.
Unfortunately I have not mapped out the RX / TX pins on my PCBs, therefore I have no possibility of logging the serial data
My send rate is not high. One use case e.g. is an LED dimmer with a controllable ramp-time. Therefore, each time before my home automation server sends a new dim value it also sends the desired ramp-time. So it's only two messages directly one after another. Admittedly, I do not know how fast my HA server is sending those messages. But in my understanding the MYS gateway should care about the send timing, right?
Could it be plausible that the second interrupt fires while the µC is still reading the first message from the transceiver, thus getting a messed up message stack? Or maybe there is a problem with clearing the stack after a message has been received. Consequently, it fills up and after that everything is rejected... Unfortunately my C skills are very limited, so I cannot look for a probable cause in the code myself.
Could it be plausible that the second interrupt fires while the µC is still reading the first message from the transceiver, thus getting a messed up message stack?
The interrupt triggering the message reception from the radio will not preempt a running message reception.
Or maybe there is a problem with clearing the stack after a message has been received. Consequently, it fills up and after that everything is rejected... Unfortunately my C skills are very limited, so I cannot look for a probable cause in the code myself.
If you don't empty the buffer fast enough it will fill up and new messages get lost. See https://github.com/mysensors/MySensors/blob/development/hal/transport/RF24/MyTransportRF24.cpp#L44
The variable transportLostMessageCount will be increased for each lost message. Having serial debug output could really help here...
@yveaux I try to hack something together to get serial debug data If I find some time this week. I'm personally really curious what I will see there.