ACK -aka ECHO beeing missed by Serial Gateway with RF24 radios. V2.3.2



  • When communicating from the gateway to a node there seems to be none, or an insufficient delay for the TX/RX turnaround time. This causes the GW not to fully receive the incoming ACK message and therefore fail the transmission. With a protocol sniffer it can be observed that the receiving node sent out the ACK message immediately , but otherwise properly.

    The mere inclusion of MY_DEBUG serial print messages on the receiving node, "solves" the issue by
    providing a slight delay between reception and subsequent transmission of the ACK/ECHO package allowing the transmitting node to switch from transmission to reception properly.

    A delay of 3-4 ms in file MyTransport.cpp / line 703 in the code block of

    if ( msg.getRequestEcho()) {
         TRANSPORT_DEBUG(PSTR("TSF:MSG:ECGO REQ\n")
    #ifdef MY_RADIO_RF24
         delay(4);
    #endif
    ....
    ...
    ..
    }
    

    Solves the issue of the lost ack messages.

    This has been tested on multiple RF24 modules -- With and without LNA/PA both on the gateway and node sides.

    Suggest to add above code to the code base so that other people don't have to go through days of debugging to find this problem again.

    See post further down for details and explanation .

    G.S.





  • @JeeLet said in ACK -aka ECHO beeing missed by Serial Gateway with RF24 radios. V2.3.2:

    Waiting

    "Waiting using the Arduino delay() command is not a good idea. " ... https://www.mysensors.org/download/sensor_api_20

    Agreed - change delay(2); to wait(2); and see if that solves the problem



  • @skywatch yes, wait() works too. My thinking for using delay() was that I did not want any possibility of any incoming message being processes during that time, so therefore delay instead of wait. But I did not do any further digging in the code to assess such risk or do any stress testing with the node being a repeater also , etc. I figure the implementers of the lib would know what's best when implementing a fix. It's also possible that the root cause lies in the Serial GW code if other gateways do not have this problem. It could be that it just takes the GW code a little too long to switch into RX mode to catch the ACK. I figured @hek or @Yveaux would get to the bottom of it and know what to do. Do they still read the forums or do I need to do a pull request on the GIT ?

    GS



  • I don't think adding a timer function in "MyTransport.cpp" is a good thing,
    Serial Gateway has many hours of flight and many people use it, there are many possible causes to your problem.

    • a saturated herztian atmosphere
    • the RF module
    • the structure of the scketch
      -....
      have a lower flow and can be useful?
      #define MY_RF24_DATARATE (RF24_250KBPS)
      #define MY_RF24_PA_LEVEL (RF24_PA_HIGH)

    but thanks for this solution you give, it can be useful for others.



  • @GaryStofer Hmmmm. I understand your situation and requirements but sadly I don't have an answer.

    Only thing I can think of would be to try the node without the repeater function and see if that is in any way causing the issue. That would narrow it down considerably.

    @JeeLet has hit on the RF module as a possible cause. Have you tried different module? Are all RF modules the same? Are all GW's the same?



  • First let me say that the test "network" consists of a single node and a Serial GW with a Domoticz controller only. No repeater nodes are present and it operates at 250Kbd on a frequency outside of the WiFi spectrum verified to have no traffic. The test subjects are within 10 feet of each other and the protocol sniffer receiver is placed roughly midway between the two communicating nodes. The test node has a plugable connector for the RF modules so that I can switch them out separately. The GW has the LNA/PA module and the node is using the standard mudules. Both GW and node are set to "High" power.

    Looking at the problem with the promiscuous protocol sniffer and wireshark I can clearly see that packets collide mid air when both the node and the GW transmitting at the same time. This happens when the controller/GW talks to the node while having the protocol ACK request enabled and the node does not have the extra wait/delay programmed.

    Going through 12 RF modules node side, I can see that about 1/3 shows the problem consistently to the extent that all 16 retransmissions are used up , 1/3 to various degrees and the rest shows no problems at all. Before you say "ohh fake chips etc." these modules all came from the same batch, so more than likely not fake and non fake, but rather some parameter that the lib assumes to tight of a margin for.

    When turning off the protocol level ACK via the controller/GW, the package goes out without an request for the node to send back the package, only the HW ack remains in the picture. Setup like this I can see that the third of the modules that cause the persistent collisions evoke a re-send of the package by the GW after ~2 to 2.5ms , while the group that shows no problem shows one single transmission from the GW only. Since the sniffer is not fast enough to capture the quick turn around ( spec says 130us) for the short HW ack to go out I can not get say whether the node is simply not sending it or just not sending it in quick enough. All I can say is that the GW code did not see a HW ack and therefore sends out the package a second time.

    Meanwhile on the node side all is well, it just might get the same package twice and since no status changed there is no harm done. However when you now turn on the protocol level ACK request in the controller ( as you should) the node has to turn around and send out the echo package which then collides with the resend of the original package from the GW. The node also doesn't get an HW-ACK becasue of this and tries to resend the echo package over again until it runs out of tries.

    When I then introduce the wait/delay in the code above to wait out the 2.5Ms for the GW to resend the initial package this ACK/resend storm goes away because the air is clear when the node sends it's echo package. If I wait/delay for 2 ms the problem is greatly reduced and if I wait 3 ms I see no more evidence of collisions. Either wait or delay works however, I choose delay() to be positively sure that no other communication is started during that time by the node.

    I conclude that as a work around this solves the issue we are having with the RF24 transport, but without further investing into faster RF sniffing equipment I can't say for sure if this could be solved better on the TX side by waiting longer for the HW ack to arrive from the receiving node, however I think there would be a good chance for that.

    There is also a chance that the PA/LNA module used on the GW could be part of the problem by not being able to switch quickly enough back to RX mode and therefore loose the HW-ACK for nodes that respond quicker than others.

    Edited: Checked with a non PA/LNA module on the GW and no change could be observed, so this theory can be discounted.

    BTW: All of this is very reminiscent of an email communication in 2015 I had with Ekblad !!

    GS



  • This post is deleted!

Log in to reply
 

Suggested Topics

1
Online

11.2k
Users

11.1k
Topics

112.5k
Posts