Problem with Recursive calls on signed node (Solved)



  • Dear All,
    I have a newish issue with a node, which only seems to have manifested since I upgraded to 2.3.2 on the node (gateway still on 2.2.0)
    The node is my "smart" thermostat, which has been operating at various revision levels for 3 years or more, generally without issue. this node platform is a Teensy 3.2 operating at 24MHz see this post
    Since upgrading to 2.3.2 to "resolve" issues previously encountered forum entry here although this particular node doesn't sleep, I am in the process of migrating nodes to 2.3.2. Subsequent to a code / library version revision to this thermostat node, it has started to "lock up / stop transmitting " of a semi regular basis. I recently asked about determining that a node is "off line" from the nodes perspective here and was spending a few happy hours playing, and was musing about leaving a laptop connected to the serial OP of the node and logging to file the output, to assist in fault finding. whilst doing this I have had the node connected to my PC for several hours, and have seen several occurrances of the situation described here and referenced above.
    example

    6173359 TSF:MSG:READ,0-0-10,s=1,c=3,t=16,pt=0,l=0,sg=0:
    6173369 TSF:MSG:SEND,10-10-0-0,s=255,c=3,t=17,pt=6,l=25,sg=0,ft=1,st=OK:<NONCE>
    6174314 TSF:MSG:READ,0-0-10,s=1,c=1,t=45,pt=0,l=4,sg=1:19.0
    6174316 TSF:MSG:ECHO REQ
    6175680 !TSF:MSG:SEND,10-10-0-0,s=1,c=1,t=45,pt=0,l=4,sg=0,ft=0,st=NACK:19.0
    *InMsgty :45 MsgComd:1 childID:1 Data:S/19.0 Data:I/ 19 Data:B/ 1
     Incoming SET TempSP:19.00
    6176899 !TSF:MSG:SEND,10-10-0-0,s=1,c=1,t=45,pt=7,l=5,sg=0,ft=1,st=NACK:19.0
    TX Error 1
    6176899 !MCO:WAI:RC=1
    6176899 !MCO:PRO:RC=1
    6176899 !MCO:PRO:RC=1
    6176899 !MCO:PRO:RC=1
    6176899 !MCO:PRO:RC=1
    6176899 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    6176900 !MCO:PRO:RC=1
    

    there are THOUSANDS of such lines, sequentially. way too many to put in a post, although I can make the full log available on a file share
    As I understand it this is an issue relating to 2.3.2 and signing.

    Is there any liklyhood of this being resolved (fixed) in the immediate future?
    If not, am I best to roll back to a previous revision of MySensors? If so, which?
    I should obviously prefer NOT to turn off signing for this node.
    The node is currently falling over every day to a week, which is clearly undesirable. This node prior to changing to 2.32. would go for many months, without falling over, the "issues" I have had related to not receiving data in particular (now mostly resolved by workarounds)

    I know this is going to be a awkward issue to help with, bur many thanks in advance...

    Regards
    Nigel



  • @Nigel31 These messages are warnings that you are recursively calling the wait() function.
    I think this warning is only issued since version 2.3.2.
    In older versions there is certainly the same problem, but you do not receive a warning.

    Under certain circumstances, this recursion can also occur during normal operation of the MyS lib. I think it is specifically in connection with the signing functionality. There was already a thread about this problem and the lib issue
    https://github.com/mysensors/MySensors/issues/1458.

    These "internal" warnings are unpleasant, but not a cause for concern.

    It is also possible that you are causing the recursion through your application program.
    This case is the real purpose of this warning.
    You should check whether you call the wait() function within your receive() function.



  • @virtualmkr

    Many thanks for the input. I don't use wait() within the receive function, however, as part of my "mittigation" for lost / failed coms, I have the following code. Do you believe looking at it, that the wait(repeatdelay); after the send(msg,ack); preceeding it, may end up with "collisions" in the case of a delayed response?

    If these occurrences are not harmful as such, I am not sure what else I may have done to cause the vast decrease in the node's stability.

    
    bool resend(MyMessage & msg, bool ack, int repeats)
    {
      if ((millis() - lastTXtime) < 200) wait(millis() - lastTXtime);
      Watchdog.reset();
      TX(1);// flash Tx iCON on HMI
      int repeat = 1;
      int repeatdelay = 0;
      volatile boolean sendOK = false;
    
      if (ack == false or ack == 0) {
        send(msg, false);
      }
      else {
        while ((sendOK == false) and (repeat < repeats)) {
          if (send(msg, ack)) {
            sendOK = true;
            //Serial.println(" Sent OK ACK Received");
          } else {
            sendOK = false;
            Serial.print("Tx Error ");
            Serial.println(repeat);
            repeatdelay += 200;
            if (repeatdelay >= 1000) {
              repeatdelay = 1000;
            }
            Watchdog.reset();
          }
          repeat++;
          wait(repeatdelay);
        }
        lastTXtime = millis();
      }
      return sendOK;
    }
    
    

    Also on requests

    bool RErequest(int Child, int Type, int repeats)
    {
      if ((millis() - lastRXtime) < 200) wait(millis() - lastRXtime);
      Watchdog.reset();
      int repeat = 1;
      int repeatdelay = 0;
      volatile boolean REQOK = false;
    
      while ((REQOK == false) and (repeat < repeats)) {
        if (request(Child, Type)) {
          REQOK = true;
          Serial.println(" Request OK ACK Received");
        } else {
          REQOK = false;
          Serial.print("Rx Error ");
          Serial.println(repeat);
          repeatdelay += 200;
          if (repeatdelay >= 1000) {
            repeatdelay = 1000;
          }
          Watchdog.reset();
        }
        repeat++;
        wait(repeatdelay);
        //wait(repeatdelay,message.getCommand(),message.getType());
      }
      lastRXtime = millis();
      return REQOK;
    }
    

    In an effrot to help with things, and to point either to my code, or the library version, I have rolled back the node to 2.2.0, only having to change isEcho() to isAck(), so I'll see how the stability goes.

    Many thanks again, Nigel



  • @Nigel31 Hi again, yes looking at your code these wait() calls inside a while loop may be the reason for the recursion. You could check this when you add debug messages before and after your waits. If then are in the log the recursion messages between your own wait begin/end messages you have the "smoking gun". In this case you should refactor your code to use state machines instead of the wait() functions. In general the loop() function with all sub function calls should always be as short as possible.



  • @virtualmkr
    The logs already show that preceding the recursive calls that TXError1 for example proceeds the entries. In and off itself I thought it a little suggestive, as I couldn't see any without a retransmitted effort. (There wouldn't be a entry at all if the first transmission successfully occurred)
    I'll wind things back to no retransmissions, and see how that behaves. I'm not sure how to use a state machine for this, I've got some reading and experimenting to do. Any hints or know of any other code out there in the mysensors world to check for successfull end to end transmissions, and consequently retransmissions?

    Many thanks for the input

    Regards Nigel



  • Dear all,

    I have removed code for "resending" if no hardware ack.

    I have come to the conclusion that having wait(200); or even wait(1000); after a send, but before another send, can result in the above recursive calls log entries, when "collision" occurs in the code,

    My node is now stable again, and running on 2.3.2.
    I still loose many transmissions to the node, but am compensating, (as I had prior to the resending code) by the setting of flags, both in the node and on the controller, such that the controller resends the data, untill the " I have had an update on child x" flag is reset, this works, although not in a near realtime way, but sufficient for the changing of setpoints on the thermostat.
    I did try a repeater, but that didn't help.

    Many thanks for the input
    Nigel


  • Contest Winner

    @Nigel31 Maybe it's best to use the wait outside a while loop. With this I mean. Right now you're calling a function that does the resends within the same while loop - so basically it blocks the main loop. The only thing I can think of, is that MySensors doesn't get time to execute all it needs to do before you call the next request.

    So if you'd rewrite the code a little bit so it does all the requests in different execution of the main loop() method. It might work much better. As I do it in a MessageQueue I've written and don't have the problems you have. Also sometimes it takes a while before a node starts using the repeater. If it's close enough to the gateway to get some responses. Than - if understand correctly - it will not start using the repeater until the repeater sees x-time of failures.


Log in to reply
 

Suggested Topics

  • 3
  • 1
  • 2
  • 2
  • 2
  • 24

0
Online

11.2k
Users

11.1k
Topics

112.5k
Posts