[most likely solved] RS485 nodes stop sending data after some hours or days

  • Hi all,

    my serial GW recently stops transfering messages to the controller (FHEM) after several hours of operation (or even days). As soon as a "connect"-command on FHEM is issued, new messages from my nodes are processed again, but after a longer period, it will fail again. The "connect" seems to cause a complete reboot of the GW, despite there is no change of date wrt to the initial binding in the filesystem in linux (date in ls -l /dev/serial remains unchanged). Rebooting all or individual nodes does not have any effect.

    Does anyone else have similar observations?

    Some background information:

    • All nodes+GW use MySensors 2.2.0-beta, programmed via Arduino-IDE@linux

    • GW-Arduino is FTDI-based (seems to be no fake - I changed the USB identifiers -, Test-PIN is connected to ground)

    • Everything was fine for weeks using just one of my nodes (Node_1) and the respective GW

    • Problems started as soon as I added two new nodes some days ago.

    • Wiring:
      -- All nodes are wired on just one line, no stubs (had one first to Node_2, but changed this already)
      -- cable (CAT7) starts at GW, 15-20m (just one pair for data) to Node_2, 6-7m to Node_1 (+12V for power supply, provided at Node_1), 8-9m to Node_3 (again: data+12V)
      -- beside the screwed connections to the RS485-modules, there exist one or two additional connections via small WAGO clamps (4 in total until now, at least one in between every node)
      -- all nodes+GW have the "long" modules (chinese source, ebay), all resistors still in place.

    -Nodes (mentioned only the sending childIDs):
    -- Node_1: 7 DS18B20, 1 Counter = 10 infos to be sent every 5 minutes. (some of them in some cases a bit more often)
    -- Node_2: 12 DS18B20, 1 Counter = 15 infos to be sent every 5 minutes (again additional things like 3x motion and a switch when relevant)
    -- NODE_3: BME280 = 3-4 infos every minute

    • Power:
      -- GW is powered by an active USB hub, that also powers 4 other Arduino-based devices
      -- all other nodes are powered by the mentioned 12V-line using just internal regulator (Node_3) + an additional adjustable step-down-module (ok for up to 36V DC in).

    • Some delay in sending is already implemented in Node_2, this seems to help a bit (I originally was convinced to have powering issues with this node, so I applied the known workaround from nRF-tranceivers. But in fact, it seems to work properly and infos are lost on GW side).

    • All Nodes have #define MY_TRANSPORT_WAIT_READY_MS xxx, xxx beeing different on each node (3, 15, 30 seconds)

    Possible root causes and next steps:

    • Change GW-HW (Arduino+Transceiver module), maybe I damaged the later when making test while adding the new nodes (really doubt this now wrt the longer times of correct operation, but we will see).
    • Replace the adjustable power modules by some LMS1117 modules (should provide more power), esp. to Node_2 , where I can see that one of the 3 buses for DS18B20 (5 Sensors) is also not working reliably
    • Try to reduce the amount of data send by the nodes by adding some delay?
    • your ideas...

    Will keep you updated in any case!

  • Short update on the issue, as I tried, how the things go when Node_1 is not online:
    Everything seemed to be fine until yesterday afternoon. Then both remaining nodes stopped working, but not at the same point in time:

    • last info from Node_2 was received around 5 pm, (all 8 values that should be updated have been received)
    • Node_3 send last infos at 7 pm (2/3 Values, temp+hum) resp. 7:30 pm (pressure).

    I then tried the GW-reboot as mentionned in post#1, but still I don't get updates from the nodes.
    Conclusion: seems not to be a GW problem as originally suspected, but what else?!?

    I now will also depower and reboot the remaining two nodes and then see, when they will fail next and have a look in the log of Node_2, if it was sending as regularly as should until fail.

    Any ideas how to solve this nasty problem?

  • Mod

    Did you put termination resistors at both ends of the bus? I remember there was a suggestion to modify the rs485 library to send 3 times the header message before sending the payload in order to avoid collisions.

  • Along the lines of what @gohan was saying about termination resistors, many of the modules that you buy these days have the termination resistors built in. The image below shows examples of two modules and their termination resistors.
    If you have multiple devices on your RS485 bus, typically only your ending node on the bus should have the termination resistors. This image shows a typical RS485 master/slave bus with termination.
    If slaves 1, 2 or 3 had termination resistors, there is the potential that the bus signal could get attenuated to the point of dropping off. Having only a few devices on the bus all with termination resistors may work. You may have drop offs though as you are seeing in your case. The more devices you put on your bus. the greater the chance of attenuation if the resistors are in place. You may want to try removing the termination resistors on your middle nodes if any and see if that fixes your problem. Take note of where these resistors are in place in the event that you may need to re-solder these to the module. On the two modules shown above you have 20K ohm (203) resistors that go from VCC to B and GND to A, and then there is a 120 ohm (121) resistor between A and B. You need to remove all 3 for the middle nodes.

  • @dbemowsk & @gohan Thanks for pointing me back to the resistor topic, my recent observations also directed towards problems in the electrical design of my bus.
    The modules used are all similar to the second on @dbemowsk's picture (got two versions differing slightly in colour), as already stated in post#1, all resistors are still present.
    The irritating thing is the bus working for hours before problems become visible.

    ( I missed the most recent discussion on resistors and rs485 in the "Build"-section, sorry for that).

    So first step will be to use two desoldered modules for nodes 1&2, shouldn't be a big issue to change these. I'll keep you updated.

    @gohan Just in case this doesn't lead to a permanently working soulution: Do you have a link to the suggestion of sending the header?

  • Those 20K resistors are so small factor that removing them might not be necessary but removing extra termination is. I'm also using 600ohm pull-ups and pull-downs in the middle of the bus but I think 1k might also be ok and it needs little less juice from vcc.
    This might be helpfull: http://alciro.org/tools/RS-485/RS485-resistor-termination-calculator.jsp

    @gohan said in Serial RS485 Gateway stops receiving after some hours or days:

    I remember there was a suggestion to modify the rs485 library to send 3 times the header message before sending the payload in order to avoid collisions.

    For the library code i did this modification to MyTransportRS485.cpp

    Added this to beginning of the lib file:

    #if !defined(MY_RS485_SOH_COUNT)
      #define MY_RS485_SOH_COUNT 3

    And in the sending code:

    // Start of header by writing multiple SOH
    for(byte w=0; w<1; w++) {

    Changed to this:

    // Start of header by writing multiple SOH
    for(byte w=0; w<MY_RS485_SOH_COUNT; w++) {

  • So one more intermediate update:

    • Desoldered all resistors (R5 to R7 on the LC-tech rs485-modules) on nodes 1&2 (was not much more work, so also the 20k's, just to be sure...) and added pullup/pulldown 1k's on Node_2 (that is somewhere in a middle position wrt. other planed nodes and near the 12V power source) as proposed.
    • Didn't change anything on the code yet.

    Until now, everything looks fine, I get updated values as expected from all of the three.

    I'll keep you updated, but (I hope so) this will take some time to have longer-term-results.

  • (Changed title because problem seems not to be the GW only)

    Another update on the topic, unfortunately not with good news:

    • I added one more node (Node_4) at the end of my wiring. This is equipped a "full-resistor"-version (LTtech-) module (just 2 Motion sensors attached, not regularly reporting data)
    • Resistors on Node_3 have been removed. So only GW (cable start) and Node_4 (wire end) have 120Ohm (and other) resistors on the RS485 modules, all other resistors have been removed. Additionally pullups/pulldowns at Node_2 (2*1k) are installed.
    • Code base still is "standard"

    Everything starts fine, if I put the 12V on (this powers all my nodes together). Last time I did this was yesterday around 5:30 pm. Today's findings:

    • Node_1 still is sending data as expected, code see here
    • Node_2 stopped transmission around 7:00 pm
    • Node_3 (BME280) sent last data in around half an hour later. I tried to bring it back online by pressing the reset button, but that didn't have any effect.
    • Node_4 seems to be still online, last motion was reported today 7:13 am

    Conclusions and working hypothesis for now:

    • The Bus itself seems to be ok, also the GW. Or did I miss something essential?
    • Also all transceiver modules in general are working (especially no hardware defect(?)), but at some point in time they fail and cannot be reset other than by depowering them.

    So next step will be to apply @pjr's modified version of MyTransportRS485.cpp...

  • @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    So next step will be to apply @pjr's modified version of MyTransportRS485.cpp...

    That should only help in case of collisions.

    Wondering if you could check if those "dead" nodes are trying to send anything by attaching USB-RS485 adapter to the bus and check if there is any activity after pressing reset on "dead" node.

    Other helpful thing could be a "datalogger" for testing nodes that will die eventually: https://forum.mysensors.org/topic/6340/debug-to-a-sd-card-module

  • @pjr Thx for the hint with the USB-RS458-Adapter, I will try to find out if I get additional infos by this means. Building a SD-Card-logging node would be a new experience to me (the necessary hardware for both is laying around)...

    Wrt to collisions: To me, it is not unlikely this is the root cause of my troubles: two of the nodes are sending a unusual (?) high amount of data and using more or less the same timing (300000 ms), the third is sending every minute, so some overlap in transmission timings is most likely at some point in time.
    This also is correlling with a recent finding: Sometimes I have missing data for a longer periode, but then again singals come in until finally the node seems really to stop all transmissing actions.

    So there seems also to exist some kind of buffering on this type of transceiver. Could failure also be related to a kind of buffer-overload? Is the arduino expecting some kind of feedback from the transceiver or just writing data to it as it would do to any serial line?

    One additional thought: Could results be more reliable if I use a higher baud rate on the bus? This measure should shorten transmission times and be ok wrt. the length of my bus. First trial could be with 38400.

    But this would not help if any buffer overflow leads to blocked transceivers, only available time slots for sending data would be increased.

    What to do first?

  • Short update:
    As I don't really like the idea to use other than "standard" code, first measure was to set baud rates to a higher speed. First impression after just some hours: seems to work.
    Next, I will review my code to make timings a little more dynamic - by now, time needed for transmission will not be reflected when resetting timers for measuring and transmission. My hope: this may result in less collisions.
    Some explanation: Timing is based on millis(). Millis() is requested only once every loop() (at the beginning) and then used as a fixed variable for the remaining loop(). If - based on this - 5 min. (on two of the nodes) have passed, a lot of info will be written to the bus. Even with 38400 baud, sending all info requires quite some time (in most cases (looking at what the controller reveils wrt this) the timestamp of individual infos differs 1 sec.). My conclusion: significant parts of a second are necessary for transmitting 8-15 individual measurement-datasets.
    So resetting the timer values not only based on millis() at loop() entrance but also after sending all data was done may lead to the nodes all have their own (slightly different) timing, perhaps also with some kind of "feedback" or "self-healing effect" if transmission is delayed due to actual collisions on the bus.

  • @pjr said in RS485 nodes stop sending data after some hours or days:


    So I finally changed MyTransportRS485.cpp and then will see, if this helps (baudrate still kept at 38400).
    By now, I didn't set the default value to 3 but in the individual sketches (by now: Gateway + Nodes 1+2 (those with the higher amount of data to be sent) use the triple deader initialisation method).
    Doing so, I will be reminded to also have a look at the cpp in case of future updates (as the SOH-count setting is stored in each of the sketches).

    In case this helps, I will make a pull request to make this option more easy to use for others.

    Some more observations for my testing without this fix:

    • The bus itself seems to be pretty robust now, as even if one of the nodes fails to send in data, the controller still gets updates from the others.
    • Every now and then, individual values from single children may not have been updated as expected
    • At some point in time, communication of one of the "big" nodes will fail. Which one seems to depend on the starting order. If I power up both together, it was at least 2 times Node 2, when powering node 1 later led at least once to a broken communication with that one.
    • The nodes themselfes seem still to work as expected (one has a pir-functionality, so it's easy to test...). But even pressing just the reset button will not bring it back to RS485 communication. So I would bet, the point of failure is the RS485 module/MAX485 IC, that needs to be reset by a complete power-off.

  • @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    • The nodes themselfes seem still to work as expected (one has a pir-functionality, so it's easy to test...). But even pressing just the reset button will not bring it back to RS485 communication. So I would bet, the point of failure is the RS485 module/MAX485 IC, that needs to be reset by a complete power-off.

    Hmm. I did read the max485 datasheet and there is this text:

    Drivers are short-circuit current limited
    and are protected against excessive power dissipation by
    thermal shutdown circuitry that places the driver outputs into
    a high-impedance state. The receiver input has a fail-safe feature
    that guarantees a logic-high output if the input is open circuit.

    Current-Limiting and Thermal Shutdown for Driver Overload Protection

    I'm just wondering if the bus can be in some kinda state that the protection kicks in?
    Can you try recycle power from the rs-485 driver only without depowering arduino when the "jam" state happens? Would be interesting to know if it cures the data transmission.

  • Mod

    you could power it using a transistor that you could control from the arduino

  • Thx for your answers.

    So one more update after having the "tripple-header-fix" implemented in Nodes 1+2:

    • Everything seemed to work fine at first sight. I could even see Node_2 "overrun" Node_1: it took some hours until the two seconds my controller originally stated as the time between them have gone to zero. At the moment, Node_2 reported partly first, I only lost two of the messages that are regulary sent. If this would happen only one time per day or so, this wouldn't be an issue. This was yesterday around 4pm.
    • Then I noticed Node_2 stopped sending in messages around 7:30pm, all others where still fine until 11pm.
    • In the morning, I saw messages from Node_2 reported from around 5:24am, so there must have been some activity in between that then again stopped. All others still seemed to be fine.
    • I unplugged the RS485-module then.
      First only GND+Vcc, but by doing so at least the LED stayed on. So I also unplugged the other side of the module => no messages; then pressed arduino's reset button => node was online again
    • At around 7pm all nodes seem to be offline, last message from Node_1: 12:10:09, Node_2: 16:05:03, Node_3: 16:09:58, Node_4: 15:52:56. (All presentation messages had pretty old timestamps, so no spontaneous reboots had happened, see below)
      Tried now to
      -- reset the GW (via FHEM-connect, this is what originally seemed to work as reported in one of the first posts)
      -- reset Node_4 (button): not even a presentation message
      Still nothing happened.

    BUT THEN I checked if Node_1 is still "alive" wrt the "normal" arduino funcionality (pir=>light): completely DEAD. So I pushed the reset-button, but left all other things untouched (especially power to all nodes was not cut, also to the RS485-module on that particular node: All nodes where there again!

    • Also the presentation messages from Node_1 and Node_4 where renewed, but not from Nodes 2+3 that hadn't been reset.
    • Other data was then updated in the regular way, so nothing that could be interpreted as "retained" message or so was kept in memory
    • For the last couple of minutes while writing this down, all nodes reported as expected.

    So still no clue, how to solve this. I will review hard- & software on Node_1 (and lateron Node_2), especially wrt. powering.

  • Can you measure bus voltage when everything is "dead"?
    Is it idle or is some of the nodes pulling it up or down?

  • @pjr I'll try to do some measurements next time everything is really dead.
    But I really doubt if this is only related to a bus problem or also a unlucky combination of at least two things:

    • Bus:
      Prior to having read your post this morning, I noticed everything being offline again. So I began to reset some of the nodes.
      Some more background: Yesterday I noticed Node_2 was sending again when I reset Node_1, so my first attempt was to start with that one and blaming it to be somehow faulty and expected the rest to show up automatically. It indeed started sending again, and so did Node_3 (without reset!). But still Node_2 showed no sign of life. So I also reset that one - again with the effect it was reporting data as expected. Node_4 also showed no pir data, so I finally also reset that one.

    -Second possible root cause:
    3 of my nodes also have relay functionality, two of them with several DS18B20.
    Now there's someone reporting nodes "dying" also with the same combination of attached hardware...
    the only exception here is Node_4 - it has no temp at all, and also is the node with the least data to be written on the bus. So the only node that comes back is the one without "relay" and just a BME280.

  • @pjr As Node_2 was not sending any data some minutes ago: between A+B I measured 2.23V...
    Then I depowered everything. Short after repowering, I have around 0.03V.

    What to do with this info?

  • Hardware Contributor

    @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    Now there's someone reporting nodes "dying" also with the same combination of attached hardware... [...]

    Yes. Same combination of sensor. I did not understand totally your entire setup (sorry, I'm a bit noob 🙂 ) , but we have same sensors combination.

    I will swap the temp with a STH31 and - more important - the barebone Atmega with an Arduino Mini 3.3V. I will update asap.

    Good luck for your investigating. Really interested 🙂

  • One more:
    Node_2 stopped transmitting for a longer periode during this night, but was online again some minutes ago.
    Node_1 was not transmitting, but still showed pir functionality. So code still seemed to work, just communication was broken.
    Node_3 was also transmitting, most likely also after a periode of inactivity.

    Now I cut power to Node_1 and then measured 0.03 V between A+B. So I'll leave the other three nodes online and will see, if they work fine.
    Most likely I will have to intensively review the entire wiring on Node_1 one more time, including the 1wire-Networks attached to it.