[SOLVED] Timing issue with 2.3.1, RF24 and Mega 2560



  • Hi,

    I discovered a timing issue with MySensors 2.3.1 and NR24 "standard" (non PA, with capacitor and good power).

    The gateway is a Nano with standard gateway. Signing is turned on. Basic debug is turned on.

    All is working perfectly as long as the nodes are also on Nano.

    However, this is not the case with Mega 2560.

    This seems to be linked with the radio ACK of node PING request at startup. The gateway properly receives nonce request form the node, and answer to it, but always get radio NACK. Here's the node trace:

    0 MCO:BGN:INIT NODE,CP=RNNNAS--,REL=255,VER=2.3.1
    4 MCO:BGN:BFR
    FF test sensor
    V3.5.1/1.8.6/2.3.1
    RNNNAS--
    CE=49, CS=53, MISO=50, MOSI=51, SCK=52
    46 TSM:INIT
    47 TSF:WUR:MS=0
    53 TSM:INIT:TSP OK
    55 TSM:INIT:STATID=127
    57 TSF:SID:OK,ID=127
    59 TSM:FPAR
    60 TSM:FPAR:STATP=0
    62 TSM:ID
    63 TSM:ID:OK
    64 TSM:UPL
    67 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    2074 TSM:UPL
    2077 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    4084 TSM:UPL
    4087 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    6094 TSM:UPL
    6097 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    8105 !TSM:UPL:FAIL
    8107 TSM:FPAR
    8109 TSM:FPAR:STATP=0
    8111 TSM:ID
    8112 TSM:ID:OK
    8113 TSM:UPL
    8116 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    10124 TSM:UPL
    10127 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    12134 TSM:UPL
    12137 TSF:MSG:SEND,127-127-0-0,s=255,c=3,t=24,pt=1,l=1,sg=0,ft=0,st=OK:1
    

    Here's gateway log:

    0;255;3;0;9;0 MCO:BGN:INIT GW,CP=RNNGAS--,REL=255,VER=2.3.1
    0;255;3;0;9;40 TSM:INIT
    0;255;3;0;9;43 TSF:WUR:MS=0
    0;255;3;0;9;51 TSM:INIT:TSP OK
    0;255;3;0;9;53 TSM:INIT:GW MODE
    0;255;3;0;9;56 TSM:READY:ID=0,PAR=0,DIS=0
    0;255;3;0;9;60 MCO:REG:NOT NEEDED
    0;255;3;0;14;Gateway startup complete.
    0;255;0;0;18;2.3.1
    0;255;3;0;9;64 MCO:BGN:STP
    0;255;3;0;9;71 MCO:BGN:INIT OK,TSP=1
    0;255;3;0;9;74 TSM:READY:NWD REQ
    0;255;3;0;9;112 TSF:MSG:SEND,0-0-255-255,s=255,c=3,t=20,pt=0,l=0,sg=0,ft=0,st=OK:
    0;255;3;0;9;329 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;335 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;375 !TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=NACK:1
    0;255;3;0;9;2338 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;2344 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;2384 !TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=NACK:1
    0;255;3;0;9;4347 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;4354 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;4393 !TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=NACK:1
    0;255;3;0;9;6366 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;6372 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;6412 !TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=NACK:1
    0;255;3;0;9;8375 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;8381 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;8420 !TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=NACK:1
    

    I'm thinking about timing issue as everything is getting well if I turn RF24 debug on on node. Here's gateway log (node log is too large to be included, but available on request):

    0;255;3;0;9;0 MCO:BGN:INIT GW,CP=RNNGAS--,REL=255,VER=2.3.1
    0;255;3;0;9;34 TSM:INIT
    0;255;3;0;9;36 TSF:WUR:MS=0
    0;255;3;0;9;44 TSM:INIT:TSP OK
    0;255;3;0;9;47 TSM:INIT:GW MODE
    0;255;3;0;9;50 TSM:READY:ID=0,PAR=0,DIS=0
    0;255;3;0;9;54 MCO:REG:NOT NEEDED
    0;255;3;0;14;Gateway startup complete.
    0;255;0;0;18;2.3.1
    0;255;3;0;9;58 MCO:BGN:STP
    0;255;3;0;9;64 MCO:BGN:INIT OK,TSP=1
    0;255;3;0;9;68 TSM:READY:NWD REQ
    0;255;3;0;9;106 TSF:MSG:SEND,0-0-255-255,s=255,c=3,t=20,pt=0,l=0,sg=0,ft=0,st=OK:
    0;255;3;0;9;15935 TSF:MSG:READ,127-127-0,s=255,c=3,t=24,pt=1,l=1,sg=0:1
    0;255;3;0;9;15942 TSF:MSG:PINGED,ID=127,HP=1
    0;255;3;0;9;15947 TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=25,pt=1,l=1,sg=0,ft=0,st=OK:1
    0;255;3;0;9;15987 TSF:MSG:READ,127-127-0,s=255,c=3,t=15,pt=6,l=2,sg=0:0101
    0;255;3;0;9;15997 TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=15,pt=6,l=2,sg=0,ft=0,st=OK:0101
    0;255;3;0;9;16032 TSF:MSG:READ,127-127-0,s=255,c=3,t=16,pt=0,l=0,sg=0:
    0;255;3;0;9;16048 TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=17,pt=6,l=25,sg=0,ft=0,st=OK:<NONCE>
    0;255;3;0;9;16102 TSF:MSG:READ,127-127-0,s=255,c=0,t=17,pt=0,l=5,sg=1:2.3.1
    127;255;0;0;17;2.3.1
    0;255;3;0;9;16134 TSF:MSG:READ,127-127-0,s=255,c=3,t=16,pt=0,l=0,sg=1:
    0;255;3;0;9;16149 TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=17,pt=6,l=25,sg=1,ft=0,st=OK:<NONCE>
    0;255;3;0;9;16203 TSF:MSG:READ,127-127-0,s=255,c=3,t=6,pt=1,l=1,sg=1:0
    127;255;3;0;6;0
    0;255;3;0;9;18237 TSF:MSG:READ,127-127-0,s=255,c=3,t=26,pt=1,l=1,sg=1:2
    0;255;3;0;9;18249 TSF:MSG:SEND,0-0-127-127,s=255,c=3,t=16,pt=0,l=0,sg=0,ft=0,st=OK:
    

    I didn't found any help on this forum, but tried to compare MySensor's RF24 code and Arduino's standard one, and found some differences on timing (the most significant one being after setting CE high or low).

    Before starting modifying the code to (try to) find a solution, it may be clever to ask the forum first πŸ˜‰


  • Mod

    @flyingdomotic there was a change between 2.3.0 and 2.3.1 that addressed a similar problem (not related to signing though): https://github.com/mysensors/MySensors/pull/1193
    But since you're already using 2.3.1 it is probably something else.



  • @mfalkvidd Yes, I saw that, and already back ported the 2.3.1 change to 2.3.0, with great improvement in 2.3.0 reliability.
    However, there's still something in timing with 2.3.1, at least when micro-controllers are not the same.


  • Mod



  • @Yveaux: I added the delayMicroseconds(1500) at top of rf24.cpp, and recompiled both node & gateway, but nothing changed...

    LOCAL bool RF24_sendMessage(const uint8_t recipient, const void *buf, const uint8_t len,
                                const bool noACK)
    {
    	uint8_t RF24_status;
    	delayMicroseconds(1500);	// *** FF_CHANGE ***
    	RF24_stopListening();
    
    

    I perhaps not properly got the idea...


  • Mod

    @flyingdomotic said in Timing issue with 2.3.1, RF24 and Mega 2560:

    I perhaps not properly got the idea...

    Perhaps not. The issue states it has to be added to MyTransport.cpp, exactly at this line: https://github.com/mysensors/MySensors/blob/development/core/MyTransport.cpp#L730



  • @Yveaux: ok, I was not doing change at the right place. However, even at the right place, nothing changed. If I correctly understood, the proposed change only deals with "software" ACK (meaning the receiver re-sends back the just received value to the sender), while the issue on ping is the "hardware" ACK (RF24 acks the correct reception at radio level). However, this probably means that sender is not ready to read the answer (in this case, the nonce from the gateway).

    I started reading RF24 datasheet. The only required timing I found are the 10 us CE should stay high at minimum in order to switch from standby to send mode, the 4 us minimum between CE up transition and CSR down and the 100 ms at power up.

    I also understood that PA+LN version needs 10 us (hopefully, this is hidden by the 10 us of CE transition to high).

    Did I miss something?


  • Mod

    @flyingdomotic I suspect it might have the same failure cause, when a node/gateway switches from receive to transmit (for the PONG response) too fast, preventing the hardware ACK from being transmitted to the sending node (the PING).
    Especially this code looks suspicious: https://github.com/mysensors/MySensors/blob/development/core/MyTransport.cpp#L786
    Try enabling that code on the fast node first.

    You could also try adding a delay after reception of a message to assure there's time for the hardware ACK just before the return statement here: https://github.com/mysensors/MySensors/blob/development/hal/transport/RF24/MyTransportRF24.cpp#L148



  • @Yveaux: I fixed the issue. I removed all delayMicroseconds in RF24.cpp (except those in RF24_standBy), and added a delayMicroseconds(1500) on very top of RF24_stopListening. I tested with a Nano gateway + Mega node, a Mega gateway and a Nano node, and a Nano node and a Nano gateway. Everything is running smoothly.

    Modified file is available (I tried to upload it on forum,, but "Upload file" was returning an error on file type, only allowing images.

    Does someone know how to propose the change for a new version of MySensors (sorry, I'm something new on that kind of process ;-)?

    In addition, I wrote a small Excel tool to decode MySensors log, mainly on TSF:MSG and RF24:* (see example of output here under). If someone is interested, I can push it by mail, but I would prefer making it available more widely. Is there any already existing way to do this (other than putting on my personal github account)?

    0_1552509706495_MySensorsLogParser.jpg


  • Mod

    @flyingdomotic the code contribution guidelines are available at https://www.mysensors.org/download/contributing

    But I doubt the delays were sprinkled randomly around the code for no reason, so while it seems to work based on your limited testing, performing the proposed change will probably break communication for a lot of people.

    For the log decode, maybe your tool could be merged with https://www.mysensors.org/build/parser ?
    An alternative might be to publish it as a google spreadsheet?


  • Mod

    @flyingdomotic I agree with @mfalkvidd. The delays follow the original nrf24 datasheet closely. If it works for you, it might break operation for many other users.
    Your changes likely mask the real issue.
    Did you try my suggestions?



  • @mfalkvidd: thanks for the pointer. For sure, I only did limited testing. However, I carefully read before the NRF24L01+ data-sheet, analyzed chronogram and tried to understand what may happens. But anyway, I must admit that I'm just a newbie in MySensors, and probably had wrong in the approach. Concerning Google spreadsheet, it's not able to understand VBA, and my code dirty code use it hardly πŸ˜‰

    @Yveaux: I'm interesting by the reference in RF24 datasheet that may explain the delays I suppressed, to learn about my error, as I miss them. As of now, I'm thinking that RF24 with auto ACK is properly designed, with symmetrical switch delay of 130 us, except if you turn CE down before sending ACK of the just received message. As the ACK may be repeated, I choose to use the same delay as ARD, even if the probability is not so high. However, you're probably right telling that will break operations for other users. I'll let you decide if adding some delay at top of stop_listening function may help or not. And yes, I tried to apply your reco first, but it didn't worked. I suspect the other delays influencing the issue I had, but removing them all won't worked too. It only worked after adding delay before turning CE down.

    Anyway, thanks to all for your help.

    Last thing, is someone can tell me how to add [SOLVED] in front of my post, that'll be nice!


  • Mod

    @flyingdomotic I agree that references to relevant part of the datasheet for each delay would be good. Now they are just magic numbers. Without such references, understanding, reviewing, extending and maintaining the code is extremely hard.

    For [solved], edit your first post and change the headline (you might need to scroll up to see the headline if you are on mobile)


 

373
Online

8.3k
Users

9.1k
Topics

96.7k
Posts