MQTT - RFM69 Gateway stops communicating randomly and doesn't recover
I have built a MySensors network based on RFM69H modules, with an Ethernet MQTT Gateway built on Leonardo which I detailed here: https://forum.mysensors.org/topic/6249/mqtt-ethernet-gateway-using-leonardo-32u4-w5100-rfm69h-hard-spi
My three sensor nodes are mains-powered, a mix of Atmega328 and 32u4 doing temperature, environment, energy and HVAC control.
I'm building on Windows 7, Arduino 1.8.1, MySensors 2.1.1
It works really great most of the time, but I have an issue which has cropped up randomly and is starting to bug me a lot. Occasionally it seems like the gateway will just stop sending or receiving radio messages. None of the sensor reports get in nor do control signals get out. To resolve this I have to power cycle the gateway, and since I'm often away for long periods my system is completely paralysed until then. Sometimes it will work for a week or more without issue, and sometimes it will occur several times a day. I haven't been able to correlate the occurrences with anything that happens on the sensor network or the OpenHAB bus. Once I reset the gateway all of my other nodes re-join the network and start working again, so for now I have assumed they're all working properly and the fault lies within the gateway.
I started logging the debug output from the gateway, and the issue has occurred twice since then. Both times it seems to be related to the radio receiving a garbage packet of some kind. After then it seems like the gateway code cannot talk to the radio module correctly, but doesn't seem to do anything about that eg. reset the radio. It seems to pass sanity checks okay and assume everything is good.
Here's two examples of the garbage packets being received:
0;255;3;0;9;TSF:MSG:READ,11-47-153,s=240,c=4,t=205,pt=6,l=22,sg=0:A4709004C242000524F00D0381C00057D048B18C0808 0;255;3;0;9;!TSF:MSG:LEN,59!=29 0;255;3;0;9;TSM:READY:NWD REQ 0;255;3;0;9;TSF:MSG:SEND,0-0-255-255,s=255,c=3,t=20,pt=0,l=0,sg=0,ft=0,st=OK: 0;255;3;0;9;TSF:SAN:OK
0;255;3;0;9;TSF:MSG:READ,144-0-0,s=160,c=5,t=0,pt=5,l=0,sg=1:117440612 0;255;3;0;9;!TSF:MSG:LEN,61!=32 0;255;3;0;9;TSF:SAN:OK 0;255;3;0;9;TSM:READY:NWD REQ 0;255;3;0;9;TSF:MSG:SEND,0-0-255-255,s=255,c=3,t=20,pt=0,l=0,sg=0,ft=0,st=OK: 0;255;3;0;9;TSF:SAN:OK
The NWD - MSG SEND - SAN OK cycle repeats continuously thereafter, sometimes punctuated by attempts from the controller to send commands which never get out as such:
0;255;3;0;9;Message arrived on topic: sensors-in/2/4/1/0/2 0;255;3;0;9;!TSF:MSG:SEND,0-0-2-2,s=4,c=1,t=2,pt=0,l=1,sg=0,ft=0,st=NACK:0 0;255;3;0;9;TSM:READY:NWD REQ 0;255;3;0;9;TSF:MSG:SEND,0-0-255-255,s=255,c=3,t=20,pt=0,l=0,sg=0,ft=0,st=OK: 0;255;3;0;9;TSF:SAN:OK
It looks like the network discovery routine is succeeding, but I never see any activity on the gateway LEDs after the fault occurs so I assume it's returning a successful result without actually interrogating the network at all (maybe that's by design? I haven't gone digging to see).
The garbage packets could possibly be related to the Oregon Scientific weather station I have on my roof. I believe it communicates using PAM 433MHz and I wouldn't be surprised if it isn't friendly about sharing the spectrum and just talked over the MyS packets. This would fit with the randomness of the failure, since it requires a OS weather sensor to clash with a MyS sensor message. I would've thought (hoped) that the gateway would be able to ignore and recover from such an event though.
Try adding the watchdog timer that will reset the gateway in case the code hangs somewhere. Or you could move to mysensors 2.2.0 that got some bugs fixed
@gohan Thanks, I didn't realise 2.2 had been released yet
Edit: My bad, 2.2 hasn't been released it's the dev branch
Yes, it's dev branch but it is working. Give it a try
I have upgraded the whole network to 2.2.0-beta using the new RFM69 driver. I started by upgrading just the gateway but it wasn't backwards compatible with the sensors. Maybe the new radio driver changes something.
Will wait and see if it fails again.
if you have upgraded to new rfm69 driver, you need all nodes+gw updated too.
@scalz Thanks yeah I have upgraded all the nodes and they're all talking again on the new driver
So far so good. I'm feeling tentatively like this problem may be solved in 2.2-beta, and my sensor network is starting to gain my trust again.