[solved] RS485 nodes stop sending data after some hours or days
-
Thx for your answers.
So one more update after having the "tripple-header-fix" implemented in Nodes 1+2:
- Everything seemed to work fine at first sight. I could even see Node_2 "overrun" Node_1: it took some hours until the two seconds my controller originally stated as the time between them have gone to zero. At the moment, Node_2 reported partly first, I only lost two of the messages that are regulary sent. If this would happen only one time per day or so, this wouldn't be an issue. This was yesterday around 4pm.
- Then I noticed Node_2 stopped sending in messages around 7:30pm, all others where still fine until 11pm.
- In the morning, I saw messages from Node_2 reported from around 5:24am, so there must have been some activity in between that then again stopped. All others still seemed to be fine.
- I unplugged the RS485-module then.
First only GND+Vcc, but by doing so at least the LED stayed on. So I also unplugged the other side of the module => no messages; then pressed arduino's reset button => node was online again - At around 7pm all nodes seem to be offline, last message from Node_1: 12:10:09, Node_2: 16:05:03, Node_3: 16:09:58, Node_4: 15:52:56. (All presentation messages had pretty old timestamps, so no spontaneous reboots had happened, see below)
Tried now to
-- reset the GW (via FHEM-connect, this is what originally seemed to work as reported in one of the first posts)
-- reset Node_4 (button): not even a presentation message
Still nothing happened.
BUT THEN I checked if Node_1 is still "alive" wrt the "normal" arduino funcionality (pir=>light): completely DEAD. So I pushed the reset-button, but left all other things untouched (especially power to all nodes was not cut, also to the RS485-module on that particular node: All nodes where there again!
- Also the presentation messages from Node_1 and Node_4 where renewed, but not from Nodes 2+3 that hadn't been reset.
- Other data was then updated in the regular way, so nothing that could be interpreted as "retained" message or so was kept in memory
- For the last couple of minutes while writing this down, all nodes reported as expected.
So still no clue, how to solve this. I will review hard- & software on Node_1 (and lateron Node_2), especially wrt. powering.
-
Can you measure bus voltage when everything is "dead"?
Is it idle or is some of the nodes pulling it up or down?@pjr I'll try to do some measurements next time everything is really dead.
But I really doubt if this is only related to a bus problem or also a unlucky combination of at least two things:- Bus:
Prior to having read your post this morning, I noticed everything being offline again. So I began to reset some of the nodes.
Some more background: Yesterday I noticed Node_2 was sending again when I reset Node_1, so my first attempt was to start with that one and blaming it to be somehow faulty and expected the rest to show up automatically. It indeed started sending again, and so did Node_3 (without reset!). But still Node_2 showed no sign of life. So I also reset that one - again with the effect it was reporting data as expected. Node_4 also showed no pir data, so I finally also reset that one.
-Second possible root cause:
https://forum.mysensors.org/topic/7743/node-with-ds18b20-relay-dies-also-with-watchdog
3 of my nodes also have relay functionality, two of them with several DS18B20.
Now there's someone reporting nodes "dying" also with the same combination of attached hardware...
the only exception here is Node_4 - it has no temp at all, and also is the node with the least data to be written on the bus. So the only node that comes back is the one without "relay" and just a BME280. - Bus:
-
Can you measure bus voltage when everything is "dead"?
Is it idle or is some of the nodes pulling it up or down? -
@pjr I'll try to do some measurements next time everything is really dead.
But I really doubt if this is only related to a bus problem or also a unlucky combination of at least two things:- Bus:
Prior to having read your post this morning, I noticed everything being offline again. So I began to reset some of the nodes.
Some more background: Yesterday I noticed Node_2 was sending again when I reset Node_1, so my first attempt was to start with that one and blaming it to be somehow faulty and expected the rest to show up automatically. It indeed started sending again, and so did Node_3 (without reset!). But still Node_2 showed no sign of life. So I also reset that one - again with the effect it was reporting data as expected. Node_4 also showed no pir data, so I finally also reset that one.
-Second possible root cause:
https://forum.mysensors.org/topic/7743/node-with-ds18b20-relay-dies-also-with-watchdog
3 of my nodes also have relay functionality, two of them with several DS18B20.
Now there's someone reporting nodes "dying" also with the same combination of attached hardware...
the only exception here is Node_4 - it has no temp at all, and also is the node with the least data to be written on the bus. So the only node that comes back is the one without "relay" and just a BME280.@rejoe2 said in RS485 nodes stop sending data after some hours or days:
[...]
Now there's someone reporting nodes "dying" also with the same combination of attached hardware... [...]Yes. Same combination of sensor. I did not understand totally your entire setup (sorry, I'm a bit noob :) ) , but we have same sensors combination.
I will swap the temp with a STH31 and - more important - the barebone Atmega with an Arduino Mini 3.3V. I will update asap.
Good luck for your investigating. Really interested :)
- Bus:
-
One more:
Node_2 stopped transmitting for a longer periode during this night, but was online again some minutes ago.
Node_1 was not transmitting, but still showed pir functionality. So code still seemed to work, just communication was broken.
Node_3 was also transmitting, most likely also after a periode of inactivity.Now I cut power to Node_1 and then measured 0.03 V between A+B. So I'll leave the other three nodes online and will see, if they work fine.
Most likely I will have to intensively review the entire wiring on Node_1 one more time, including the 1wire-Networks attached to it. -
One more:
Node_2 stopped transmitting for a longer periode during this night, but was online again some minutes ago.
Node_1 was not transmitting, but still showed pir functionality. So code still seemed to work, just communication was broken.
Node_3 was also transmitting, most likely also after a periode of inactivity.Now I cut power to Node_1 and then measured 0.03 V between A+B. So I'll leave the other three nodes online and will see, if they work fine.
Most likely I will have to intensively review the entire wiring on Node_1 one more time, including the 1wire-Networks attached to it.@rejoe2 I did not understand one thing: what uCU are you using? Atmega328 barebones? If yes... what the setup of BOD?
Tonight I did re-bootload my faulty node with BOD @2.7V. Seems more stable, after about 7h. Just to say... an idea.... -
@rejoe2 I did not understand one thing: what uCU are you using? Atmega328 barebones? If yes... what the setup of BOD?
Tonight I did re-bootload my faulty node with BOD @2.7V. Seems more stable, after about 7h. Just to say... an idea....@sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.
When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.
So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.
Or is it necessary to completely remove also the modules when there's no power to them?
Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
Other ideas or recommendations? -
@pjr As Node_2 was not sending any data some minutes ago: between A+B I measured 2.23V...
Then I depowered everything. Short after repowering, I have around 0.03V.What to do with this info?
@rejoe2 said in RS485 nodes stop sending data after some hours or days:
@pjr As Node_2 was not sending any data some minutes ago: between A+B I measured 2.23V...
Then I depowered everything. Short after repowering, I have around 0.03V.What to do with this info?
+-200mV is the magic number with rs485. rs485 line 3 states:
- Va - Vb < -0.2V = "1"
- Va - Vb > 0.2V = "0"
- |Va - Vb| < 0.2V = "idle"
As I know the line should be in idle state when nobody is sending.
So for me it looks like something is pulling the line constantly to state "1" or "0" depending which way you did measure it. This could be caused by faulty transceiver, bug in library code, bug in your code..
Next time can you measure whats coming from arduino? So measure between GND and TX(or pin 9 if using AltSoftSerial). And of course between GND and DE pin. This way we can resolve if the problem is at arduino side or transceiver side. -
@sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.
When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.
So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.
Or is it necessary to completely remove also the modules when there's no power to them?
Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
Other ideas or recommendations?@rejoe2 said in RS485 nodes stop sending data after some hours or days:
@sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.
When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.
So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.
Or is it necessary to completely remove also the modules when there's no power to them?
Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
Other ideas or recommendations?Hi,
just to share, I will do also a post in some day. I did get the 96h-no stop configuration. Well, with some stop, but no trouble on re-start.
Power-feed node: optiboot 6.2 with 2.7V bod.
Battery feed nodes: optiboot 6.2 with 1.8 bod.
Watchdog on startup at 2S
3 try on startup and go in loop.If no ack received for 3 times, on every single send (e.g. getting the link, sketch name, temp, relay state, et cetera), delay for 5 sec. << this delay does the "magic". Watchdog restarts the node(s) and loop again.
I did test disconnecting the serial Arduino as gateway for 1h and / or mantaining rebooting push button for 20 minutes (my poor finger :D )
As soon as gateway is on, in several minutes all nodes are alive and transmitting. I did try also remove/put radio on nodes while live. They reconnect as charme.
So, I would force all your nodes to do a deep restart if some trouble occours. Just my 2 cents....
-
@rejoe2 said in RS485 nodes stop sending data after some hours or days:
@sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.
When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.
So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.
Or is it necessary to completely remove also the modules when there's no power to them?
Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
Other ideas or recommendations?Hi,
just to share, I will do also a post in some day. I did get the 96h-no stop configuration. Well, with some stop, but no trouble on re-start.
Power-feed node: optiboot 6.2 with 2.7V bod.
Battery feed nodes: optiboot 6.2 with 1.8 bod.
Watchdog on startup at 2S
3 try on startup and go in loop.If no ack received for 3 times, on every single send (e.g. getting the link, sketch name, temp, relay state, et cetera), delay for 5 sec. << this delay does the "magic". Watchdog restarts the node(s) and loop again.
I did test disconnecting the serial Arduino as gateway for 1h and / or mantaining rebooting push button for 20 minutes (my poor finger :D )
As soon as gateway is on, in several minutes all nodes are alive and transmitting. I did try also remove/put radio on nodes while live. They reconnect as charme.
So, I would force all your nodes to do a deep restart if some trouble occours. Just my 2 cents....
@sineverba I have some problems with my RS485 sensors too. They working for few days like a charm and than one of them stops sending and receiving data. Most of the time it happend when I click button and relay switch the light. My wiring is ok, i have pull-ups and pulldowns in the middle on master and termination on both ends. I have watchdog enabled
void before() { wdt_disable(); // maybe redundant wdt_enable(WDTO_8S); // sensors.begin(); }But even with that the node won't reboot so i think it may not hangs and only lost communication. Maybe its something wrong with AltSoftSerial lib ??
I should mention that I'm using OneButton lib to extend functionality of my pushbuttons for long press and double click. Maybe that library have some issues with AltSoftSerial or MySensors ?
-
@sineverba I have some problems with my RS485 sensors too. They working for few days like a charm and than one of them stops sending and receiving data. Most of the time it happend when I click button and relay switch the light. My wiring is ok, i have pull-ups and pulldowns in the middle on master and termination on both ends. I have watchdog enabled
void before() { wdt_disable(); // maybe redundant wdt_enable(WDTO_8S); // sensors.begin(); }But even with that the node won't reboot so i think it may not hangs and only lost communication. Maybe its something wrong with AltSoftSerial lib ??
I should mention that I'm using OneButton lib to extend functionality of my pushbuttons for long press and double click. Maybe that library have some issues with AltSoftSerial or MySensors ?
-
@nofox try to remove code as much as you can. Does it still work if you operate it with the button? Is relay opto isolated?
-
Check position of nodes on the bus to in failure conditions.
With RS485 bus drivers is easy possible for one node to block communication on entire bus sending dominant state.
In this situation, nodes near the gateway can "push" their messages to the gateway, other nodes not. -
Hi! Everything working pretty well, but sometimes some random node stops to communicate and react for pressing buttons. I have watchdogs in every nodes so I think that only communication is hanging.. Is it possible that only altsoftserial library hanging inside arduino code ??
-
As I could nail down some more parts (but still do not have a reliably network), also a short update from my side:
- Node_1 (Multi DS18B20 (*12@three pins) + other things) is the biggest troublemaker. It just pulled the Voltage between a+b to +2.8V after some time. There is some hours of delay between the last messages and the node stopping also the pir functionality (no wdt code implemented).
- Node_2 (also Multi DS18B20 (*5@three pins) and other stuff) also stopps communication after some time (it originally worked, this may be related to whatever change happened in between). But this one doesn't kill the entire bus communication and seems to work internally (switches relay on in case a rise of temperature is detected). This also holds my pullpup+pulldown-resistors for RS485.
Yesterday I switched over Node_1 to use HW_SERIAL, as I also suspected altsoftserial to be part of the root causes. At first sight this seems to improve things a lot.
Next, I will review Node_2 for the use of HW_SERIAL.What I have in mind (may not be correct):
- HW_SERIAL uses less memory. So this may prevent the node to have some kind of overflow
- there may be an conflict in internal timers, as 1wire may also need a timer (I use amongst others also PIN10 for 1wire).
-
There is one thing that we all need to try. When you using RS485 than you have power supply somewhere far far away from nodes. Longer power lines means higher inductance and far more noise on power lines. I think we need to try to put some 10 - 100uF electrolitic cap on all nodes (i have 10uF on each node) and few ceramic 100nF near the microprocessor on every node. If you use atmega328p you need at least 3 of 100nF caps ( i forget to put them on my nodes). I’we read that this 100nF caps are very big improvement in power supplying the atmega.