[Solved] Non-healing !TSM:MSG:LEN error hours into run



  • So I'm having a strange error that occurs hours into my runtime. I've narrowed it down to a communications issue with the node that occurs when I start a small DC motor attached to my node. At first look, I figured there could be a power issue causing the voltage to dip at the radio, but I figured I had enough hold up to bridge the in-rush of the motor start. Here's a quick description of my setup:

    • Arduino Pro-Mini 3.3V 8MHz
    • NRF24 set to PA_MAX
    • Project powered by a 12V 3A power supply
    • Arduino power from a 7805 regulator with a 10uF cap
    • NRF24 powered from an L4931 3.3V voltage regulator with a 47uF cap
    • Motor powered straight from the 12V supply through a Mosfet controlled by the arduino
    • Arduino, Radio, and motor housed in same project box

    With that configuration, I'm assuming there isnt a power draw issue, perhaps an RF/EMI issue from the brushed motor being in close proximity to the radio. With that said, I was under the impression that the network would heal itself and missed/partial messages would not cause a latched issue like I'm seeing. I'm even doing a isTransportReady() check prior to sending messages and that check is not failing.

    Here is what I'm seeing in the debug:

    Startup and steady state snippet for reference... no issues here:

    [Mon Oct 24 07:50:53.749 2016] 0 MCO:BGN:INIT NODE,CP=RNNNA--,VER=2.0.1-beta
    [Mon Oct 24 07:50:53.856 2016] 4 TSM:INIT
    [Mon Oct 24 07:50:53.856 2016] 4 TSM:INIT:TDC
    [Mon Oct 24 07:50:53.856 2016] 12 TSM:INIT:TSP OK
    [Mon Oct 24 07:50:53.856 2016] 14 TSM:INIT:STATID=1
    [Mon Oct 24 07:50:53.856 2016] 16 TSF:SID:OK,ID=1
    [Mon Oct 24 07:50:53.856 2016] 18 TSM:FPAR
    [Mon Oct 24 07:50:53.856 2016] 20 TSM:FPAR:STATP=0
    [Mon Oct 24 07:50:53.856 2016] 22 TSM:ID
    [Mon Oct 24 07:50:53.856 2016] 22 TSM:ID:OK
    [Mon Oct 24 07:50:53.856 2016] 24 TSM:UPL:DISABLED
    [Mon Oct 24 07:50:53.856 2016] 26 TSM:READY
    [Mon Oct 24 07:50:53.856 2016] 30 TSF:MSG:SEND,1-1-0-0,s=255,c=3,t=15,pt=6,l=2,sg=0,ft=0,st=OK:0100
    [Mon Oct 24 07:50:53.856 2016] 43 TSF:MSG:SEND,1-1-0-0,s=255,c=0,t=17,pt=0,l=10,sg=0,ft=0,st=OK:2.0.1-beta
    [Mon Oct 24 07:50:53.856 2016] 53 TSF:MSG:SEND,1-1-0-0,s=255,c=3,t=6,pt=1,l=1,sg=0,ft=0,st=OK:0
    [Mon Oct 24 07:50:53.856 2016] 61 TSF:MSG:READ,0-0-1,s=255,c=3,t=15,pt=6,l=2,sg=0:0100
    [Mon Oct 24 07:50:53.856 2016] 75 TSF:MSG:READ,0-0-1,s=255,c=3,t=6,pt=0,l=1,sg=0:I
    [Mon Oct 24 07:50:53.856 2016] 81 MCO:REG:NOT NEEDED
    [Mon Oct 24 07:50:53.856 2016] 83 MCO:BGN:STP
    [Mon Oct 24 07:50:53.856 2016] ... Starting Setup ...
    [Mon Oct 24 07:50:55.772 2016] 2007 TSF:MSG:SEND,1-1-0-0,s=255,c=3,t=1,pt=0,l=0,sg=0,ft=0,st=OK:
    [Mon Oct 24 07:50:55.772 2016] Requesting time from Gateway:
    [Mon Oct 24 07:50:55.772 2016] 2027 TSF:MSG:READ,0-0-1,s=255,c=3,t=1,pt=0,l=10,sg=0:1477295454
    [Mon Oct 24 07:50:55.803 2016] Time value received and updated...
    [Mon Oct 24 07:50:55.803 2016] Clock updated....
    [Mon Oct 24 07:50:55.803 2016] Sensor's time currently set to: 07:50:54am
    [Mon Oct 24 07:50:55.803 2016] 10/24/2016
    [Mon Oct 24 07:50:57.243 2016] Loading parameters from EEPROM from Month: 10  Day: 24
    [Mon Oct 24 07:50:57.243 2016] Fillcount: 5 and Filltime: 729
    [Mon Oct 24 07:50:57.243 2016] Initialization Complete!
    [Mon Oct 24 07:50:57.692 2016] 3942 TSF:MSG:SEND,1-1-0-0,s=255,c=3,t=11,pt=0,l=14,sg=0,ft=0,st=OK:Dog Water Bowl
    [Mon Oct 24 07:50:57.761 2016] 3952 TSF:MSG:SEND,1-1-0-0,s=255,c=3,t=12,pt=0,l=3,sg=0,ft=0,st=OK:1.0
    [Mon Oct 24 07:50:57.761 2016] 3962 TSF:MSG:SEND,1-1-0-0,s=1,c=0,t=3,pt=0,l=9,sg=0,ft=0,st=OK:Bowl Fill
    [Mon Oct 24 07:50:57.761 2016] 3973 TSF:MSG:SEND,1-1-0-0,s=2,c=0,t=36,pt=0,l=5,sg=0,ft=0,st=OK:Count
    [Mon Oct 24 07:50:57.761 2016] 3981 TSF:MSG:SEND,1-1-0-0,s=3,c=0,t=36,pt=0,l=6,sg=0,ft=0,st=OK:Volume
    [Mon Oct 24 07:50:59.520 2016] ...Starting main loop...
    [Mon Oct 24 07:50:59.546 2016] 5758 MCO:BGN:INIT OK,ID=1,PAR=0,DIS=1,REG=1
    [Mon Oct 24 07:50:59.546 2016] 07:50:57am: Checking periodic sensor data.
    [Mon Oct 24 07:50:59.571 2016] Sensor values:
    [Mon Oct 24 07:50:59.571 2016] A0 = 489
    [Mon Oct 24 07:50:59.571 2016] A1 = 748
    [Mon Oct 24 07:50:59.571 2016] A2 = 894
    [Mon Oct 24 07:50:59.571 2016] Bowl water level: Low
    [Mon Oct 24 07:51:04.253 2016] 10475 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 5 time(s)
    [Mon Oct 24 07:51:04.300 2016] 10485 TSF:MSG:READ,0-0-1,s=2,c=1,t=47,pt=0,l=21,sg=0:Bowl filled 5 time(s)
    [Mon Oct 24 07:51:04.300 2016] 10493 TSF:MSG:ACK
    [Mon Oct 24 07:51:04.300 2016] This is an ack from gateway
    [Mon Oct 24 07:51:04.300 2016] 10504 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 794 mL
    [Mon Oct 24 07:51:04.300 2016] 10512 TSF:MSG:READ,0-0-1,s=3,c=1,t=47,pt=0,l=13,sg=0:Filled 794 mL
    [Mon Oct 24 07:51:04.300 2016] 10520 TSF:MSG:ACK
    [Mon Oct 24 07:51:04.300 2016] This is an ack from gateway
    [Mon Oct 24 07:51:09.305 2016] 15486 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 5 time(s)
    [Mon Oct 24 07:51:09.305 2016] 15497 TSF:MSG:READ,0-0-1,s=2,c=1,t=47,pt=0,l=21,sg=0:Bowl filled 5 time(s)
    [Mon Oct 24 07:51:09.305 2016] 15505 TSF:MSG:ACK
    [Mon Oct 24 07:51:09.305 2016] This is an ack from gateway
    [Mon Oct 24 07:51:09.305 2016] 15517 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 794 mL
    [Mon Oct 24 07:51:09.336 2016] 15525 TSF:MSG:READ,0-0-1,s=3,c=1,t=47,pt=0,l=13,sg=0:Filled 794 mL
    [Mon Oct 24 07:51:09.336 2016] 15534 TSF:MSG:ACK
    [Mon Oct 24 07:51:09.336 2016] This is an ack from gateway
    

    Here is a snippet just prior to starting the motor followed by the issue I'm seeing:

    [Mon Oct 24 07:55:24.861 2016] 270477 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 5 time(s)
    [Mon Oct 24 07:55:24.861 2016] 270487 TSF:MSG:READ,0-0-1,s=2,c=1,t=47,pt=0,l=21,sg=0:Bowl filled 5 time(s)
    [Mon Oct 24 07:55:24.861 2016] 270495 TSF:MSG:ACK
    [Mon Oct 24 07:55:24.861 2016] This is an ack from gateway
    [Mon Oct 24 07:55:24.861 2016] 270503 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 794 mL
    [Mon Oct 24 07:55:24.861 2016] 270514 TSF:MSG:READ,0-0-1,s=3,c=1,t=47,pt=0,l=13,sg=0:Filled 794 mL
    [Mon Oct 24 07:55:24.861 2016] 270520 TSF:MSG:ACK
    [Mon Oct 24 07:55:24.861 2016] This is an ack from gateway
    [Mon Oct 24 07:55:29.874 2016] 275474 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 5 time(s)
    [Mon Oct 24 07:55:29.874 2016] 275484 TSF:MSG:READ,0-0-1,s=2,c=1,t=47,pt=0,l=21,sg=0:Bowl filled 5 time(s)
    [Mon Oct 24 07:55:29.874 2016] 275492 TSF:MSG:ACK
    [Mon Oct 24 07:55:29.874 2016] This is an ack from gateway
    [Mon Oct 24 07:55:29.874 2016] 275499 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 794 mL
    [Mon Oct 24 07:55:29.874 2016] 275509 TSF:MSG:READ,0-0-1,s=3,c=1,t=47,pt=0,l=13,sg=0:Filled 794 mL
    [Mon Oct 24 07:55:29.874 2016] 275517 TSF:MSG:ACK
    [Mon Oct 24 07:55:29.874 2016] This is an ack from gateway
    [Mon Oct 24 07:55:30.007 2016] 07:55:27am: Checking periodic sensor data.
    [Mon Oct 24 07:55:30.094 2016] Sensor values:
    [Mon Oct 24 07:55:30.094 2016] A0 = 777
    [Mon Oct 24 07:55:30.094 2016] A1 = 752
    [Mon Oct 24 07:55:30.094 2016] A2 = 914
    [Mon Oct 24 07:55:30.094 2016] Bowl water level: Empty
    [Mon Oct 24 07:55:30.094 2016] 07:55:27am: Starting Motor
    [Mon Oct 24 07:55:30.094 2016] Bowl filled 6 time(s) today.
    [Mon Oct 24 07:55:30.094 2016] 275701 TSF:MSG:SEND,1-1-0-0,s=1,c=1,t=2,pt=1,l=1,sg=0,ft=0,st=OK:1
    [Mon Oct 24 07:55:30.094 2016] 275712 TSF:MSG:READ,0-0-1,s=1,c=1,t=2,pt=1,l=1,sg=0:1
    [Mon Oct 24 07:55:30.094 2016] 275718 TSF:MSG:ACK
    [Mon Oct 24 07:55:30.094 2016] This is an ack from gateway
    [Mon Oct 24 07:55:30.094 2016] 07:55:27am: Incoming change for sensor:1, New status: 1
    [Mon Oct 24 07:55:30.094 2016] 275765 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.110 2016] 275771 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.402 2016] 275984 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.402 2016] 275988 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.402 2016] 275998 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.402 2016] 276004 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.402 2016] 276011 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.402 2016] 276017 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.402 2016] 276023 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.402 2016] 276029 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.402 2016] 276058 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.434 2016] 276064 !TSF:MSG:LEN,8!=7
    [Mon Oct 24 07:55:30.434 2016] 276074 TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    [Mon Oct 24 07:55:30.434 2016] 276080 !TSF:MSG:LEN,8!=7
    

    That is what I get continuously until I send another set() message for my sensor 1, at which point my sensor set messages seem to go out, but i stop getting ack... it changes to:

    [Mon Oct 24 07:58:30.366 2016] 455651 TSF:MSG:READ,0-0-0,s=0,c=0,t=0,pt=0,l=0,sg=0:
    [Mon Oct 24 07:58:30.381 2016] 455657 !TSF:MSG:LEN,28!=7
    [Mon Oct 24 07:58:30.435 2016] 07:58:27am: Fill timer reached.  Stopping motor.
    [Mon Oct 24 07:58:30.435 2016] 455677 TSF:MSG:SEND,1-1-0-0,s=1,c=1,t=2,pt=1,l=1,sg=0,ft=0,st=OK:0
    [Mon Oct 24 07:58:30.435 2016] 07:58:27am: Saving the following values to EEPROM: 
    [Mon Oct 24 07:58:30.435 2016] Fill Time: 909
    [Mon Oct 24 07:58:30.435 2016] 6, 3, 141, 10, 24
    [Mon Oct 24 07:58:35.202 2016] 460476 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 6 time(s)
    [Mon Oct 24 07:58:35.240 2016] 460496 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 990 mL
    [Mon Oct 24 07:58:40.243 2016] 465471 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 6 time(s)
    [Mon Oct 24 07:58:40.243 2016] 465489 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 990 mL
    [Mon Oct 24 07:58:45.233 2016] 470470 TSF:MSG:SEND,1-1-0-0,s=2,c=1,t=47,pt=0,l=21,sg=0,ft=0,st=OK:Bowl filled 6 time(s)
    [Mon Oct 24 07:58:45.233 2016] 470491 TSF:MSG:SEND,1-1-0-0,s=3,c=1,t=47,pt=0,l=13,sg=0,ft=0,st=OK:Filled 990 mL
    

    I know I've posted quite a bit, but i'm stumpped... Tonight i think i'm going to try to add a delay between starting the motor and sending the status messages and cross my fingers.

    If anyone has some insight on anything i can do to reset all the communications to clear this latched fault, please let me know.

    Thanks!!!


  • Mod

    @d4rkr3b3l said:

    TSF:MSG:READ,1-1-1,s=1,c=1,t=1,pt=0,l=0,sg=0:
    !TSF:MSG:LEN,8!=7

    Seems like communication between the Arduino and the Radio, or the radio itself has gone completely haywire.
    The MSG:READ reports only ones and zeroes.
    If the communication with the radio fails it will probably not recover itself automatically...
    I would surely think towards RF/EMI issue, as you already suggested.



  • I agree something is getting fubar'd. I think the snippets that I pulled only show 1's or 0's but there are other parts of the log that show other data... from the same power cycle:

    [Mon Oct 24 07:55:39.987 2016] 285577 TSF:MSG:READ,76-76-76,s=76,c=4,t=76,pt=2,l=9,sg=1:19532
    [Mon Oct 24 07:55:39.987 2016] 285583 !TSF:MSG:LEN,20!=32
    [Mon Oct 24 07:55:39.987 2016] 285593 TSF:MSG:READ,76-76-76,s=76,c=4,t=76,pt=2,l=9,sg=1:19532
    [Mon Oct 24 07:55:39.987 2016] 285599 !TSF:MSG:LEN,20!=32
    [Mon Oct 24 07:55:40.008 2016] 285624 TSF:MSG:READ,76-76-76,s=76,c=4,t=76,pt=2,l=9,sg=1:19532
    [Mon Oct 24 07:55:40.008 2016] 285632 !TSF:MSG:LEN,20!=32
    [Mon Oct 24 07:55:40.050 2016] 285667 TSF:MSG:READ,76-76-76,s=76,c=4,t=76,pt=2,l=9,sg=1:19532
    [Mon Oct 24 07:55:40.050 2016] 285673 !TSF:MSG:LEN,20!=32
    [Mon Oct 24 07:55:40.087 2016] 285689 TSF:MSG:READ,76-76-76,s=76,c=4,t=76,pt=2,l=9,sg=1:19532
    [Mon Oct 24 07:55:40.087 2016] 285696 !TSF:MSG:LEN,20!=32
    

    So, I thought that the communication between the Arduino and the radio was monitored and any errors would result in a transport layer error... I do a isTransportReady() and that does not ever fail.

    Or is that not in the scope of the transport layer monitoring?


  • Mod

    @d4rkr3b3l nRF has no checksum over its spi communication, so data getting messed up can only be detected from context (e.g. An invalid length as the library complains about).
    However, you are getting multiple corrupted messages in a row, which are all identical. Most probably the radio gets messed up internally and reports wrong data to the library.
    The nRF driver contains a method called RF24_sanityCheck() which tests if the communication to the radio still seems valid.
    This is automatically called at regular intervals if you define MY_RF24_SANITY_CHECK before the mysensors include in your sketch.
    @tekka please correct me if I'm wrong 😉



  • @Yveaux Thanks for the info... I'll add the sanity check and give it a try.


  • Admin

    @Yveaux You're totally right 👍
    @d4rkr3b3l GW and repeater nodes have regular sanity checks enabled by default, for a regular node this feature needs to be enabled as mentioned above. In addition, you can define the interval by setting MY_TRANSPORT_SANITY_CHECK_INTERVAL_MS.



  • So I seem to have figured out my issue. Long story short, it was the brushed motor putting noise on the power lines going to the nrf24 radio.

    I was able to isolate the root cause to be a conducted susceptibility of the radio (ie. The motor was inducing noise onto the power lines of the radio) which caused communication issues between the radio and the arduino. Due to not having an O'scope at home to characterize the noise I was isolating the issue through process of elimination. I was able to get the same message length errors by unplugging the power lines to he radio after communications had been established. I then tried running my sketch, with the motor unplugged, and I never got the errors. I also powered the motor through a separate external power supply and never got the errors. This last step showed that there was a conducted issue instead of a radiated issue from the sparking of the brushes inside the motor.

    I tried to clean up the power lines by doing the following, none of which caused the error to go away:

    • added 2 additional .1uF caps to the motor (already had 1 from motor terminal to terminal, added 1 from each terminal to motor case)
    • added a 100uF cap on he 3.3v bus of he arduino
    • added additional 100nF and 10nF ceramic caps to the radio terminals to filter out higher frequencies that he 100uF and 47uF caps can't filter.

    After doing all this I was surprised that the errors persisted, so I focused on cleaning up the regulation. After I did the following, most of my issues went away:

    • moved ceramic caps (10nF and 10nF) directly on the nrf24 board.
    • upgraded my 3.3v regulator to an LD33 TO-220 package
    • added a series inductor (33uH) after the regulator and before the 147uF worth of of bypass caps

    After I did this, most of the msg length errors went away with the exception of 1 or 2 right when i turn the motor on. I added a second 33uH inductor bringing my total series inductance to 66uH. With this last change all my message length errors went away.


  • Mod

    @d4rkr3b3l Good to hear you solved it!
    And thanks for reporting back how you solved it 👍


Log in to reply
 

Suggested Topics

  • 3
  • 6
  • 1
  • 2
  • 5

52
Online

11.4k
Users

11.1k
Topics

112.6k
Posts