pi gateway just stops communicating



  • I have a pi gateway i just built this week.

    A strange scenario has presented itself, not sure how to proceed but it's definitely reducing my reliability.

    so the status of the mysgw service is running.

    0_1554052479726_639bb302-b256-4d90-8873-bfee088a1b71-image.png

    But as you can see. The last log message was last night at 11:06PM. But the service is still running. I see the process in ps.

    0_1554052529117_39878c5a-7d16-4100-a3c0-fbb581caee80-image.png

    The log output just abruptly stopped a few minutes before that. (i did a cat on the log)

    0_1554052586005_e02871c0-a7b8-4d83-a127-2761c14d5f5d-image.png

    It's in debug right now, so the log is large, but currently only 78megs total.

    As of right now the state is exactly like that. I have not killed the process to restart it. I wanted to leave it exactly as is in case anyone has any ideas.

    Any sort of watch dog I can do with this maybe?

    Thanks!


  • Mod

    @crankycoder which version are you using? The stable branch does not flush the log file after each line, so there can be a lot of data that has not yet been written to the log file (multiple hours at least).

    Except not writing to the log, does the gateway forward messages to/from your controller? If it does, the problem is probably just with the logging.

    If you can, switch to the development version.

    Pull request: https://github.com/mysensors/MySensors/pull/1269



  • Sorry, I spaced on that. It's not sending messages any more. like the process is hung up.

    This is version 2.3.1 mqtt pi gateway.



  • @CrankyCoder
    I have experienced the exact same scenario many times.

    Can you give more details on your gateway and controller setup as it might help.

    Here is what I have tried to solve the issue so far.

    1. Tried many different uSD cards and now just use SSD booting and running pi3
    2. Heatsink and fan for pi.
    3. Quality power supply and many capacitors to smooth and de-glitch the power rails.
    4. Using rpi serial ports via gpio, now currently using usb to ttl converter.
    5. Many different nrf24l01+ modules, all with added capacitors.
    6. Tried many power cables for the system as some are rather thin or poor quality.
    7. Replaced gateway promini with several new ones including 3.3V and 5V ones.
    8. Adding power rail capacitors on the pro mini board too.
    9. Replacing as many dupont connectors with soldered wires.

    Maybe something in that list will help you, it's always worth a try.



  • This is running pi gateway with the radio directly to the pi. No pro mini.

    Brand new pi 3 power supply. I swapped out my 4.7uf with a 47uf capacitor and thought that fixed it, but the mygw stopped working again today at 3:54pm.

    a service restart fixed it. Didn't reboot the pi or do anything other than just issue a restart on the mygw software.

    So I have to think maybe it's the software. I may have to go back a version from 2.3.1 to something prior. Or maybe roll forward on the dev branch.



  • @mfalkvidd gonna try this now.



  • @crankycoder Crossing my fingers for you!

    I started with nrf attached to pi, but problems there were plenty and I went to a pro mini as I found that the pi3 HW serial ports are actually software now (they nicked the HW serial for BT/WiFi) - Humph!

    You should also know that you can't do hardware signing with nrf attached to pi directly (afaik), so that might be a consideration at some point.

    Anyway, good luck with the dev branch and do let us know the results.



  • dev branch was installed last night. So far so good. Logs are flushing to disk must faster as previously indicated. But i did see this when i went back to the pi gateway code... lol

    0_1554212072217_848fd52a-12a5-484c-b6c5-1de236281b9e-image.png

    I wonder if that's what I am up against haha. Ill post an update over the next couple days.



  • My two cents worth:

    I have been battling this issue as well for the last couple of days.
    I am now using the master version on an Rpi 1, a really old one. It did work for a few hours, 1 - 10 and then stopped communicating.

    The discovery I made today was that I goofed up and built the mqtt version of the gw with the option --my-rf24-irq-pin=15 set in the config file. All well only if I had not forgotten to connect the IRQ from the rf24 to pin 15 on the RPi. Stupid!

    So, I rebuilt the gw today without this option and it is still running well after a few hours. We'll see tomorrow.



  • I read that the irq thing may not be necessary. I haven't setup the irq on mine. I don't have that much traffic for the irq to be needed yet.

    anyone else using the irq on the pi gateway?



  • @CrankyCoder IRQ made no difference when I tried it. But that was over a year ago now.



  • So I have a new development. I noticed something strange. The mysgw stopped again today. I went digging through the syslog around the same time. Didn't see anything just before it went offline, but i did notice that shortly after it went offline it said that the carrier dropped for the nic.

    Apr  2 15:50:55 raspberrypi mysgw: TSF:MSG:READ,200-200-0,s=2,c=2,t=25,pt=0,l=0,sg=0:
    Apr  2 15:50:55 raspberrypi mysgw: GWT:TPS:TOPIC=mygateway1-out/200/2/2/0/25,MSG SENT
    Apr  2 15:51:07 raspberrypi mysgw: TSF:MSG:READ,200-200-0,s=2,c=1,t=23,pt=2,l=2,sg=0:25
    Apr  2 15:51:07 raspberrypi mysgw: GWT:TPS:TOPIC=mygateway1-out/200/2/1/0/23,MSG SENT
    Apr  2 15:52:27 raspberrypi dhcpcd[377]: eth0: carrier lost
    Apr  2 15:52:27 raspberrypi kernel: [  461.474897] smsc95xx 1-1.1:1.0 eth0: link down
    Apr  2 15:52:28 raspberrypi dhcpcd[377]: eth0: deleting address fe80::669:d81d:83fb:3aaa
    Apr  2 15:52:28 raspberrypi avahi-daemon[226]: Withdrawing address record for fe80::669:d81d:83fb:3aaa on eth0.
    Apr  2 15:52:28 raspberrypi avahi-daemon[226]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::669:d81d:83fb:3aaa.
    Apr  2 15:52:28 raspberrypi avahi-daemon[226]: Interface eth0.IPv6 no longer relevant for mDNS.
    Apr  2 15:52:28 raspberrypi dhcpcd[377]: eth0: deleting default route via 192.168.2.1
    Apr  2 15:52:28 raspberrypi dhcpcd[377]: eth0: deleting route to 192.168.2.0/24
    Apr  2 15:52:28 raspberrypi avahi-daemon[226]: Withdrawing address record for 192.168.2.71 on eth0.
    Apr  2 15:52:28 raspberrypi avahi-daemon[226]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.2.71.
    

    So i decided to check a few other times in the last 48 hours and sure enough. Same thing. It seems that something is causing the nic to think it lost it's connection to the switch. The mysgw seems to go in to a weird state. Not sure what error checking is in it or what i could do to log more since i am already on debug for the logging of mysgw.

    Next step i guess will be instead of dhcp reservation ill static assign it. See if that changes anything, and see swap network cable. Maybe hard set some things like the speed/duplex on the switch.

    Just figured i would throw out my latest finding.


  • Mod

    @crankycoder maybe adding MY_DEBUG_VERBOSE_GATEWAY and MY_DEBUG_VERBOSE_RF24 can give a clue to what is happening.



  • my pi3b does this as well. it's far and few in between. I even made a watchdog in openhab to "ping" the msgw (requesting an ack) and mysgw responds but it doesn't send out new sensor readings. i thought about expanding the watchdog to watch for all sensor readings and throw up a restart to the gw if it receives no readings in a 10 minute period but mine doesn't fail nearly as consistently as yours so I haven't worried about it too much...

    would still like to see a resolution to this as I've moved most of my PIRs and temp sensors to mysgw so when it goes down it can be a bit of a problem.



  • @mfalkvidd do i need to modify the pi code and add that in and recompile? or can i add those in to the config file in /etc/mysensors.conf?


  • Mod

    @crankycoder I think you can add them in MySensors.conf but I would add them to the ./configure command. See the section called ”Advanced” at https://www.mysensors.org/build/raspberry



  • @mfalkvidd compiling now. side question, can the ip address be put in the config later? i hate that i have to hard code it in at compile time.



  • compiled with new (very noisy) options lol

    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    Apr 03 13:50:48 DEBUG RF24:RBR:REG=23,VAL=17
    

    No idea what that means, but it seems if i am seeing that, it's working lol.


  • Mod

    @crankycoder it doesn't seem to be possible to change the IP without recompiling. I checked https://www.mysensors.org/build/raspberry#configuration-file and the source code and couldn't find anything about IP.

    I have not thought about changing IP before, since I only use my-gateway=serial. It would make sense to have the IP in the config file, so if someone wants to implement it it would be great.

    A workaround might be to use --my-controller-url-address and a host name, which can then be added to /etc/hosts on the Raspberry Pi.

    Something like this should work:

    --my-controller-url-address=controller
    

    Then add the following to /etc/hosts:

    192.168.1.235  controller
    

    (assuming your controller has IP address 192.168.1.235). If you need to change the IP later, just change the IP in /etc/hosts and restart the gateway.



  • So here's the latest.

    I checked and the PI was already set to static ip and not dhcp reservation. However, I made 2 changes and have been ok for 3 days now.

    1. i changed ports on the switch. Not sure if it mattered, but i have plenty.
    2. this is the one I think may have more impact. Since it seemed the carrier was dropping, but coming back, I started wondering if maybe something was triggering an auto renegotiation for nic speed. So I went in to the switch and have it hard set to 100 FULL instead of auto.

    I still have the debug running and gonna let this run for quite some time just to make sure im feeling it's ok.


 

230
Online

8.4k
Users

9.3k
Topics

97.9k
Posts