[solved] RS485 nodes stop sending data after some hours or days



  • Hi all,

    my serial GW recently stops transfering messages to the controller (FHEM) after several hours of operation (or even days). As soon as a "connect"-command on FHEM is issued, new messages from my nodes are processed again, but after a longer period, it will fail again. The "connect" seems to cause a complete reboot of the GW, despite there is no change of date wrt to the initial binding in the filesystem in linux (date in ls -l /dev/serial remains unchanged). Rebooting all or individual nodes does not have any effect.

    Does anyone else have similar observations?

    Some background information:

    • All nodes+GW use MySensors 2.2.0-beta, programmed via Arduino-IDE@linux

    • GW-Arduino is FTDI-based (seems to be no fake - I changed the USB identifiers -, Test-PIN is connected to ground)

    • Everything was fine for weeks using just one of my nodes (Node_1) and the respective GW

    • Problems started as soon as I added two new nodes some days ago.

    • Wiring:
      -- All nodes are wired on just one line, no stubs (had one first to Node_2, but changed this already)
      -- cable (CAT7) starts at GW, 15-20m (just one pair for data) to Node_2, 6-7m to Node_1 (+12V for power supply, provided at Node_1), 8-9m to Node_3 (again: data+12V)
      -- beside the screwed connections to the RS485-modules, there exist one or two additional connections via small WAGO clamps (4 in total until now, at least one in between every node)
      -- all nodes+GW have the "long" modules (chinese source, ebay), all resistors still in place.

    -Nodes (mentioned only the sending childIDs):
    -- Node_1: 7 DS18B20, 1 Counter = 10 infos to be sent every 5 minutes. (some of them in some cases a bit more often)
    -- Node_2: 12 DS18B20, 1 Counter = 15 infos to be sent every 5 minutes (again additional things like 3x motion and a switch when relevant)
    -- NODE_3: BME280 = 3-4 infos every minute

    • Power:
      -- GW is powered by an active USB hub, that also powers 4 other Arduino-based devices
      -- all other nodes are powered by the mentioned 12V-line using just internal regulator (Node_3) + an additional adjustable step-down-module (ok for up to 36V DC in).

    • Some delay in sending is already implemented in Node_2, this seems to help a bit (I originally was convinced to have powering issues with this node, so I applied the known workaround from nRF-tranceivers. But in fact, it seems to work properly and infos are lost on GW side).

    • All Nodes have #define MY_TRANSPORT_WAIT_READY_MS xxx, xxx beeing different on each node (3, 15, 30 seconds)

    Possible root causes and next steps:

    • Change GW-HW (Arduino+Transceiver module), maybe I damaged the later when making test while adding the new nodes (really doubt this now wrt the longer times of correct operation, but we will see).
    • Replace the adjustable power modules by some LMS1117 modules (should provide more power), esp. to Node_2 , where I can see that one of the 3 buses for DS18B20 (5 Sensors) is also not working reliably
    • Try to reduce the amount of data send by the nodes by adding some delay?
    • your ideas...

    Will keep you updated in any case!



  • Short update on the issue, as I tried, how the things go when Node_1 is not online:
    Everything seemed to be fine until yesterday afternoon. Then both remaining nodes stopped working, but not at the same point in time:

    • last info from Node_2 was received around 5 pm, (all 8 values that should be updated have been received)
    • Node_3 send last infos at 7 pm (2/3 Values, temp+hum) resp. 7:30 pm (pressure).

    I then tried the GW-reboot as mentionned in post#1, but still I don't get updates from the nodes.
    Conclusion: seems not to be a GW problem as originally suspected, but what else?!?

    I now will also depower and reboot the remaining two nodes and then see, when they will fail next and have a look in the log of Node_2, if it was sending as regularly as should until fail.

    Any ideas how to solve this nasty problem?


  • Mod

    Did you put termination resistors at both ends of the bus? I remember there was a suggestion to modify the rs485 library to send 3 times the header message before sending the payload in order to avoid collisions.



  • Along the lines of what @gohan was saying about termination resistors, many of the modules that you buy these days have the termination resistors built in. The image below shows examples of two modules and their termination resistors.
    0_1505436639938_upload-cf16fba6-468a-4b3a-acc4-bedd3f620193
    If you have multiple devices on your RS485 bus, typically only your ending node on the bus should have the termination resistors. This image shows a typical RS485 master/slave bus with termination.
    0_1505437525019_upload-ba2d3e83-07b1-4d63-911a-f95c323e2d08
    If slaves 1, 2 or 3 had termination resistors, there is the potential that the bus signal could get attenuated to the point of dropping off. Having only a few devices on the bus all with termination resistors may work. You may have drop offs though as you are seeing in your case. The more devices you put on your bus. the greater the chance of attenuation if the resistors are in place. You may want to try removing the termination resistors on your middle nodes if any and see if that fixes your problem. Take note of where these resistors are in place in the event that you may need to re-solder these to the module. On the two modules shown above you have 20K ohm (203) resistors that go from VCC to B and GND to A, and then there is a 120 ohm (121) resistor between A and B. You need to remove all 3 for the middle nodes.



  • @dbemowsk & @gohan Thanks for pointing me back to the resistor topic, my recent observations also directed towards problems in the electrical design of my bus.
    The modules used are all similar to the second on @dbemowsk's picture (got two versions differing slightly in colour), as already stated in post#1, all resistors are still present.
    The irritating thing is the bus working for hours before problems become visible.

    ( I missed the most recent discussion on resistors and rs485 in the "Build"-section, sorry for that).

    So first step will be to use two desoldered modules for nodes 1&2, shouldn't be a big issue to change these. I'll keep you updated.

    @gohan Just in case this doesn't lead to a permanently working soulution: Do you have a link to the suggestion of sending the header?



  • Those 20K resistors are so small factor that removing them might not be necessary but removing extra termination is. I'm also using 600ohm pull-ups and pull-downs in the middle of the bus but I think 1k might also be ok and it needs little less juice from vcc.
    This might be helpfull: http://alciro.org/tools/RS-485/RS485-resistor-termination-calculator.jsp

    @gohan said in Serial RS485 Gateway stops receiving after some hours or days:

    I remember there was a suggestion to modify the rs485 library to send 3 times the header message before sending the payload in order to avoid collisions.

    For the library code i did this modification to MyTransportRS485.cpp

    Added this to beginning of the lib file:

    #if !defined(MY_RS485_SOH_COUNT)
      #define MY_RS485_SOH_COUNT 3
    #endif
    

    And in the sending code:

    // Start of header by writing multiple SOH
    for(byte w=0; w<1; w++) {
        _dev.write(SOH);
    }
    

    Changed to this:

    // Start of header by writing multiple SOH
    for(byte w=0; w<MY_RS485_SOH_COUNT; w++) {
        _dev.write(SOH);
    }
    


  • So one more intermediate update:

    • Desoldered all resistors (R5 to R7 on the LC-tech rs485-modules) on nodes 1&2 (was not much more work, so also the 20k's, just to be sure...) and added pullup/pulldown 1k's on Node_2 (that is somewhere in a middle position wrt. other planed nodes and near the 12V power source) as proposed.
    • Didn't change anything on the code yet.

    Until now, everything looks fine, I get updated values as expected from all of the three.

    I'll keep you updated, but (I hope so) this will take some time to have longer-term-results.



  • (Changed title because problem seems not to be the GW only)

    Another update on the topic, unfortunately not with good news:

    • I added one more node (Node_4) at the end of my wiring. This is equipped a "full-resistor"-version (LTtech-) module (just 2 Motion sensors attached, not regularly reporting data)
    • Resistors on Node_3 have been removed. So only GW (cable start) and Node_4 (wire end) have 120Ohm (and other) resistors on the RS485 modules, all other resistors have been removed. Additionally pullups/pulldowns at Node_2 (2*1k) are installed.
    • Code base still is "standard"

    Everything starts fine, if I put the 12V on (this powers all my nodes together). Last time I did this was yesterday around 5:30 pm. Today's findings:

    • Node_1 still is sending data as expected, code see here
    • Node_2 stopped transmission around 7:00 pm
    • Node_3 (BME280) sent last data in around half an hour later. I tried to bring it back online by pressing the reset button, but that didn't have any effect.
    • Node_4 seems to be still online, last motion was reported today 7:13 am

    Conclusions and working hypothesis for now:

    • The Bus itself seems to be ok, also the GW. Or did I miss something essential?
    • Also all transceiver modules in general are working (especially no hardware defect(?)), but at some point in time they fail and cannot be reset other than by depowering them.

    So next step will be to apply @pjr's modified version of MyTransportRS485.cpp...



  • @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    So next step will be to apply @pjr's modified version of MyTransportRS485.cpp...

    That should only help in case of collisions.

    Wondering if you could check if those "dead" nodes are trying to send anything by attaching USB-RS485 adapter to the bus and check if there is any activity after pressing reset on "dead" node.

    Other helpful thing could be a "datalogger" for testing nodes that will die eventually: https://forum.mysensors.org/topic/6340/debug-to-a-sd-card-module



  • @pjr Thx for the hint with the USB-RS458-Adapter, I will try to find out if I get additional infos by this means. Building a SD-Card-logging node would be a new experience to me (the necessary hardware for both is laying around)...

    Wrt to collisions: To me, it is not unlikely this is the root cause of my troubles: two of the nodes are sending a unusual (?) high amount of data and using more or less the same timing (300000 ms), the third is sending every minute, so some overlap in transmission timings is most likely at some point in time.
    This also is correlling with a recent finding: Sometimes I have missing data for a longer periode, but then again singals come in until finally the node seems really to stop all transmissing actions.

    So there seems also to exist some kind of buffering on this type of transceiver. Could failure also be related to a kind of buffer-overload? Is the arduino expecting some kind of feedback from the transceiver or just writing data to it as it would do to any serial line?

    One additional thought: Could results be more reliable if I use a higher baud rate on the bus? This measure should shorten transmission times and be ok wrt. the length of my bus. First trial could be with 38400.

    But this would not help if any buffer overflow leads to blocked transceivers, only available time slots for sending data would be increased.

    What to do first?



  • Short update:
    As I don't really like the idea to use other than "standard" code, first measure was to set baud rates to a higher speed. First impression after just some hours: seems to work.
    Next, I will review my code to make timings a little more dynamic - by now, time needed for transmission will not be reflected when resetting timers for measuring and transmission. My hope: this may result in less collisions.
    Some explanation: Timing is based on millis(). Millis() is requested only once every loop() (at the beginning) and then used as a fixed variable for the remaining loop(). If - based on this - 5 min. (on two of the nodes) have passed, a lot of info will be written to the bus. Even with 38400 baud, sending all info requires quite some time (in most cases (looking at what the controller reveils wrt this) the timestamp of individual infos differs 1 sec.). My conclusion: significant parts of a second are necessary for transmitting 8-15 individual measurement-datasets.
    So resetting the timer values not only based on millis() at loop() entrance but also after sending all data was done may lead to the nodes all have their own (slightly different) timing, perhaps also with some kind of "feedback" or "self-healing effect" if transmission is delayed due to actual collisions on the bus.



  • @pjr said in RS485 nodes stop sending data after some hours or days:

    MY_RS485_SOH_COUNT

    So I finally changed MyTransportRS485.cpp and then will see, if this helps (baudrate still kept at 38400).
    By now, I didn't set the default value to 3 but in the individual sketches (by now: Gateway + Nodes 1+2 (those with the higher amount of data to be sent) use the triple deader initialisation method).
    Doing so, I will be reminded to also have a look at the cpp in case of future updates (as the SOH-count setting is stored in each of the sketches).

    In case this helps, I will make a pull request to make this option more easy to use for others.

    Some more observations for my testing without this fix:

    • The bus itself seems to be pretty robust now, as even if one of the nodes fails to send in data, the controller still gets updates from the others.
    • Every now and then, individual values from single children may not have been updated as expected
    • At some point in time, communication of one of the "big" nodes will fail. Which one seems to depend on the starting order. If I power up both together, it was at least 2 times Node 2, when powering node 1 later led at least once to a broken communication with that one.
    • The nodes themselfes seem still to work as expected (one has a pir-functionality, so it's easy to test...). But even pressing just the reset button will not bring it back to RS485 communication. So I would bet, the point of failure is the RS485 module/MAX485 IC, that needs to be reset by a complete power-off.


  • @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    • The nodes themselfes seem still to work as expected (one has a pir-functionality, so it's easy to test...). But even pressing just the reset button will not bring it back to RS485 communication. So I would bet, the point of failure is the RS485 module/MAX485 IC, that needs to be reset by a complete power-off.

    Hmm. I did read the max485 datasheet and there is this text:

    Drivers are short-circuit current limited
    and are protected against excessive power dissipation by
    thermal shutdown circuitry that places the driver outputs into
    a high-impedance state. The receiver input has a fail-safe feature
    that guarantees a logic-high output if the input is open circuit.

    Current-Limiting and Thermal Shutdown for Driver Overload Protection

    I'm just wondering if the bus can be in some kinda state that the protection kicks in?
    Can you try recycle power from the rs-485 driver only without depowering arduino when the "jam" state happens? Would be interesting to know if it cures the data transmission.


  • Mod

    you could power it using a transistor that you could control from the arduino



  • Thx for your answers.

    So one more update after having the "tripple-header-fix" implemented in Nodes 1+2:

    • Everything seemed to work fine at first sight. I could even see Node_2 "overrun" Node_1: it took some hours until the two seconds my controller originally stated as the time between them have gone to zero. At the moment, Node_2 reported partly first, I only lost two of the messages that are regulary sent. If this would happen only one time per day or so, this wouldn't be an issue. This was yesterday around 4pm.
    • Then I noticed Node_2 stopped sending in messages around 7:30pm, all others where still fine until 11pm.
    • In the morning, I saw messages from Node_2 reported from around 5:24am, so there must have been some activity in between that then again stopped. All others still seemed to be fine.
    • I unplugged the RS485-module then.
      First only GND+Vcc, but by doing so at least the LED stayed on. So I also unplugged the other side of the module => no messages; then pressed arduino's reset button => node was online again
    • At around 7pm all nodes seem to be offline, last message from Node_1: 12:10:09, Node_2: 16:05:03, Node_3: 16:09:58, Node_4: 15:52:56. (All presentation messages had pretty old timestamps, so no spontaneous reboots had happened, see below)
      Tried now to
      -- reset the GW (via FHEM-connect, this is what originally seemed to work as reported in one of the first posts)
      -- reset Node_4 (button): not even a presentation message
      Still nothing happened.

    BUT THEN I checked if Node_1 is still "alive" wrt the "normal" arduino funcionality (pir=>light): completely DEAD. So I pushed the reset-button, but left all other things untouched (especially power to all nodes was not cut, also to the RS485-module on that particular node: All nodes where there again!

    • Also the presentation messages from Node_1 and Node_4 where renewed, but not from Nodes 2+3 that hadn't been reset.
    • Other data was then updated in the regular way, so nothing that could be interpreted as "retained" message or so was kept in memory
    • For the last couple of minutes while writing this down, all nodes reported as expected.

    So still no clue, how to solve this. I will review hard- & software on Node_1 (and lateron Node_2), especially wrt. powering.



  • Can you measure bus voltage when everything is "dead"?
    Is it idle or is some of the nodes pulling it up or down?



  • @pjr I'll try to do some measurements next time everything is really dead.
    But I really doubt if this is only related to a bus problem or also a unlucky combination of at least two things:

    • Bus:
      Prior to having read your post this morning, I noticed everything being offline again. So I began to reset some of the nodes.
      Some more background: Yesterday I noticed Node_2 was sending again when I reset Node_1, so my first attempt was to start with that one and blaming it to be somehow faulty and expected the rest to show up automatically. It indeed started sending again, and so did Node_3 (without reset!). But still Node_2 showed no sign of life. So I also reset that one - again with the effect it was reporting data as expected. Node_4 also showed no pir data, so I finally also reset that one.

    -Second possible root cause:
    https://forum.mysensors.org/topic/7743/node-with-ds18b20-relay-dies-also-with-watchdog
    3 of my nodes also have relay functionality, two of them with several DS18B20.
    Now there's someone reporting nodes "dying" also with the same combination of attached hardware...
    the only exception here is Node_4 - it has no temp at all, and also is the node with the least data to be written on the bus. So the only node that comes back is the one without "relay" and just a BME280.



  • @pjr As Node_2 was not sending any data some minutes ago: between A+B I measured 2.23V...
    Then I depowered everything. Short after repowering, I have around 0.03V.

    What to do with this info?


  • Hardware Contributor

    @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    [...]
    Now there's someone reporting nodes "dying" also with the same combination of attached hardware... [...]

    Yes. Same combination of sensor. I did not understand totally your entire setup (sorry, I'm a bit noob πŸ™‚ ) , but we have same sensors combination.

    I will swap the temp with a STH31 and - more important - the barebone Atmega with an Arduino Mini 3.3V. I will update asap.

    Good luck for your investigating. Really interested πŸ™‚



  • One more:
    Node_2 stopped transmitting for a longer periode during this night, but was online again some minutes ago.
    Node_1 was not transmitting, but still showed pir functionality. So code still seemed to work, just communication was broken.
    Node_3 was also transmitting, most likely also after a periode of inactivity.

    Now I cut power to Node_1 and then measured 0.03 V between A+B. So I'll leave the other three nodes online and will see, if they work fine.
    Most likely I will have to intensively review the entire wiring on Node_1 one more time, including the 1wire-Networks attached to it.


  • Hardware Contributor

    @rejoe2 I did not understand one thing: what uCU are you using? Atmega328 barebones? If yes... what the setup of BOD?
    Tonight I did re-bootload my faulty node with BOD @2.7V. Seems more stable, after about 7h. Just to say... an idea....



  • @sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.

    When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.

    So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.

    Or is it necessary to completely remove also the modules when there's no power to them?

    Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
    Other ideas or recommendations?



  • @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    @pjr As Node_2 was not sending any data some minutes ago: between A+B I measured 2.23V...
    Then I depowered everything. Short after repowering, I have around 0.03V.

    What to do with this info?

    +-200mV is the magic number with rs485. rs485 line 3 states:

    • Va - Vb < -0.2V = "1"
    • Va - Vb > 0.2V = "0"
    • |Va - Vb| < 0.2V = "idle"

    As I know the line should be in idle state when nobody is sending.

    So for me it looks like something is pulling the line constantly to state "1" or "0" depending which way you did measure it. This could be caused by faulty transceiver, bug in library code, bug in your code..
    Next time can you measure whats coming from arduino? So measure between GND and TX(or pin 9 if using AltSoftSerial). And of course between GND and DE pin. This way we can resolve if the problem is at arduino side or transceiver side.


  • Hardware Contributor

    @rejoe2 said in RS485 nodes stop sending data after some hours or days:

    @sineverba All nodes are ATMega32 based, running at 16MHz, 5V, Chinese Arduino clones. GW is FTDI-based Nano, Node_1 is a CH340G-Nano, the others are pro micros. Communication is via LC-Tech RS485 modules.

    When I checked the states some minutes ago, situation was as follows: Node_2 sent last messages around 4:30pm, Node_4 had been reset at around the same time (no watchdog defined), but no pir messages were sent when entering the room, so it seemed to be offline. Node_3 was alive, voltage A+B: around 0.03V.

    So now I pulled off the LC-Tech module on Node_2 and put power on again on Node_1. I'll see, if and when this one will go offline. If this leads also to no clear conclusions, I will think about first adding some caps on 5V or changing the 12V power supply.

    Or is it necessary to completely remove also the modules when there's no power to them?

    Should I try to use an older board definition (GW's with board defs starting from 1.6.13 had some reboot troubles until version 1.6.18 or so; this is pretty unfunny shooting in the dark....)
    Other ideas or recommendations?

    Hi,
    just to share, I will do also a post in some day. I did get the 96h-no stop configuration. Well, with some stop, but no trouble on re-start.
    Power-feed node: optiboot 6.2 with 2.7V bod.
    Battery feed nodes: optiboot 6.2 with 1.8 bod.
    Watchdog on startup at 2S
    3 try on startup and go in loop.

    If no ack received for 3 times, on every single send (e.g. getting the link, sketch name, temp, relay state, et cetera), delay for 5 sec. << this delay does the "magic". Watchdog restarts the node(s) and loop again.

    I did test disconnecting the serial Arduino as gateway for 1h and / or mantaining rebooting push button for 20 minutes (my poor finger πŸ˜„ )

    As soon as gateway is on, in several minutes all nodes are alive and transmitting. I did try also remove/put radio on nodes while live. They reconnect as charme.

    So, I would force all your nodes to do a deep restart if some trouble occours. Just my 2 cents....



  • @sineverba I have some problems with my RS485 sensors too. They working for few days like a charm and than one of them stops sending and receiving data. Most of the time it happend when I click button and relay switch the light. My wiring is ok, i have pull-ups and pulldowns in the middle on master and termination on both ends. I have watchdog enabled

    void before()
    {    
    
    wdt_disable(); // maybe redundant
    wdt_enable(WDTO_8S);
    // sensors.begin();
    
    }
    

    But even with that the node won't reboot so i think it may not hangs and only lost communication. Maybe its something wrong with AltSoftSerial lib ??

    I should mention that I'm using OneButton lib to extend functionality of my pushbuttons for long press and double click. Maybe that library have some issues with AltSoftSerial or MySensors ?


  • Mod

    @nofox try to remove code as much as you can. Does it still work if you operate it with the button? Is relay opto isolated?



  • @gohan No I can't operate it with button. Relays are not optoisolated, i don't isolate them because my node board are powered by 5V and i can't provide external power only for relays.


  • Mod

    I know that turning on and off loads can generate emi that arduino doesn't like. Maybe you could try with an optocoupler between arduino and the relay



  • Check position of nodes on the bus to in failure conditions.
    With RS485 bus drivers is easy possible for one node to block communication on entire bus sending dominant state.
    In this situation, nodes near the gateway can "push" their messages to the gateway, other nodes not.


  • Mod

    How do you determine the bus failure?



  • Hi! Everything working pretty well, but sometimes some random node stops to communicate and react for pressing buttons. I have watchdogs in every nodes so I think that only communication is hanging.. Is it possible that only altsoftserial library hanging inside arduino code ??



  • As I could nail down some more parts (but still do not have a reliably network), also a short update from my side:

    • Node_1 (Multi DS18B20 (*12@three pins) + other things) is the biggest troublemaker. It just pulled the Voltage between a+b to +2.8V after some time. There is some hours of delay between the last messages and the node stopping also the pir functionality (no wdt code implemented).
    • Node_2 (also Multi DS18B20 (*5@three pins) and other stuff) also stopps communication after some time (it originally worked, this may be related to whatever change happened in between). But this one doesn't kill the entire bus communication and seems to work internally (switches relay on in case a rise of temperature is detected). This also holds my pullpup+pulldown-resistors for RS485.

    Yesterday I switched over Node_1 to use HW_SERIAL, as I also suspected altsoftserial to be part of the root causes. At first sight this seems to improve things a lot.
    Next, I will review Node_2 for the use of HW_SERIAL.

    What I have in mind (may not be correct):

    • HW_SERIAL uses less memory. So this may prevent the node to have some kind of overflow
    • there may be an conflict in internal timers, as 1wire may also need a timer (I use amongst others also PIN10 for 1wire).


  • There is one thing that we all need to try. When you using RS485 than you have power supply somewhere far far away from nodes. Longer power lines means higher inductance and far more noise on power lines. I think we need to try to put some 10 - 100uF electrolitic cap on all nodes (i have 10uF on each node) and few ceramic 100nF near the microprocessor on every node. If you use atmega328p you need at least 3 of 100nF caps ( i forget to put them on my nodes). I’we read that this 100nF caps are very big improvement in power supplying the atmega.



  • Hi to all,

    i have the same problem since i changes some node to RS485. Setup
    Fhem 5.8
    Mysensors 2.2rc1
    Gateway Arduino Nano USB to Fhem.
    Nodes 5PC all Ardunino MiniPro; Energymeter, Relay,Temp DS18B20

    What I found out is, the nodes are not hanging. There is no communication to the gateway. Reboot the node does not help. After a reboot of the node pairing does not work.
    Reconnect the gateway to Fhen does not help.
    AFhem restart works ( shudown restart). After the restart all nodes are appearing them self.

    Today I will build a new gateway, using ESP8266 and RS485.
    I have used USB gateway before, with several problems. After i changes the gateway for NFR24 and RFM69 from USB to ESP8266 most of the problems are gone.

    I hope this will work. If not RS485 ist history and for this case Myssensors will be exchanges to 1-Wire.

    It takes one or two days to get the first results.

    Have a good day

    Stefan



  • @Stefan_NE
    Strange, imo using a serial GW is the most reliable option. (May be different in case the nano is a bad fake and problems with CH340G-nanos are also reported in VM environments.

    Are all nodes reffering to the korrect IO and how is the RS485-WG defined?
    Explanation: I also use a second GW and in some cases nodes are assigned to the wrong GW. If you use several /dev/ttyUSBx-defines, the IO may not be functional. See output of "ls -l /dev/serial/by-id".

    Most likely there is a electrical problem on your bus. Did you measure voltages, esp. between A+B? What type of modules or pcb's do you use?



  • @rejoe2
    good idea, that's what is thought meanwile, and i setup a serial gateway with an original Arduino Mega using hardware serial for the rs485. This is running now for 90 min. Let's see what happens. The serial gateway is the only one i use. All other nodes gateway are using wireless gateways.



  • If you have more than one MySensors-GW's defined, imo it doesn't matter what type they are. Under some circumstances, nodes may be routed through the "wrong" GW. I would recommend to check that first (may be irritating, but even with the wrong GW assinged as IO, some readings are nevertheless updated when node is reset (presentation info)).
    Hardware serial is a good idea, but at least according to my personal experience (and opposite to my estimations in the beginning) my (FTDI-Nano-) GW is one of the most reliable parts in my MySensors-RS485-environment.
    Node_2 - my "troublemaker Nr. 1" - also performs reliably now (running without issues since 5+ days) since switched to HW-serial. But also a altsoftserial-Node with BME280 works at the same level of reliability for several weeks now (with less free memory left!) .
    Powering issues and capacitors may also be helpful as @nofox suggested. I may do some tests wrt this after switching to HW-serial for Node_2 in case it's still not performing as expected.

    Last: What modules do you use? In case of the LC-Tech ones, I would recommend to desolder at least the 120Ohm resistor on the "middle" nodes.



  • New status:
    the serial gateway based on the Arduino Mega with HW Serial failed after 27 hours. Same way to fix. I needed to restart FHEM and all nodes started working without any reboot e.g.
    I don't think about any bus problem.
    Next try is the gateway with an esp8266. Setup is done and all nodes are online. The major difference is the serial buffer of 256 bytes.
    Since i have changed my wireless nodes to the ESP8266 the failure of these nodes decreases a lot.


  • Mod

    Do you have an ethernet shield? If so try to make an ethernet gateway with the mega and when it hangs try to connect to it with myscontroller and see if you get a response.



  • @Stefan_NE Did you measure voltage A-B before restarting FHEM/the attached Arduino?
    I also had very strange effects and was convinced not to have any electrical problem on the bus - I was completely wrong (see reports above).



  • One more update: This night my BME280-Node (Node_3 using altsoftserial) stopped transmitting - after around 8 days of operation... Strange!
    As Node_2 didn't crash completely in the past, I just desoldered my pullup- and pulldown-resistors that had been placed on that node (1k each). As I use the LC-Tech-Modules, now there are remaining only the full set of resistors on the GW (also the 2*20k) and at the last module in line.
    Communication seems to be stable from all nodes. As this is just one more snapshot wrt just around one hour of operation, I'm pretty interested what will happen next - or if that's just another small step in whatever direction.



  • Hi! My nodes stops working random once a week, once a month etc. I now upgrade my nodes sketch with watchdog timers. But not with avr/wdt library but with some code i’ve found in the internet. I dont upload the code to nodes but I’ve checked that the watchdog function working as I set some delay() into sketch. Its not a problem that the node hanging, the problem is They don’t restarting it self.



  • New update, no good news
    It was working for 10 hours, and after restart for another 4 hours. I will stop now the use of the RS485. It is to weak for my use case. I have spend to much time for this.
    A couple of weeks ago, i switched from NFR24 to RFM 69, this is very stable. I will go for a secound RFM69 Network for this use case.



  • One more: Node_2 has stopped transmission, so I resoldered for the use of HW-serial...
    I really suspect AltSoftSerial to be incompatible with 1wire (at least using PIN10).



  • Hi to all,

    after the change to RFM69 last sunday all 7 nodes are running without any connection lost. I added an alive message every 2min to the sketch. No messages have been lost.



  • @Stefan_NE Good to hear you finally succeded in having a reliable network.

    News from my side:
    After 5 days of operation it seems Node_2 is back in a mode of relaible communication, no more issues also with Node_1. These both are the ones sending a lot of data and use HW-serial + "triple-headed" message initialisation.

    BUT: Node_3 (BME280) is no longer continously present now, and also Node_4 (sw-serial + "single-headed") seems to have communication problems (didn't yet investigate in depth). So next step will be to first change these sketches also for the use of the triple initialisation. If that works, I'll report - it then may be a good idea to change the defaults in the MySensors-lib (to be discussed).



  • @rejoe2 how is it going with your RS485 network?



  • @pjr Short story: Still no satisfying results, but to be honest, I didn't spend too much time on that for now. The - for the moment - most important part (Node_1) works pretty reliably, the others I have to restart from time to time (Node_2 is always the first to fail)

    Longer story:

    • ordered some MAX487 chips to replace the MAX485 - this took some weeks from China and they still need to be soldered when there's time to do that...
    • GW (seems to work reliably by now):
      -- tried to use a Pro Micro with hw-serial as gw - didn't work as expected, I reported about that some weeks ago (may have been in the fhem-forum).
      -- Next step is to review it (Pro Micro, Nano or STM32F103) once more when replacing the transceivers and do some testing wrt resistor values
    • The timing on the nodes may also offer room for improvement - by now, my plan is to really delay the startup procedures (or the first measurement) and nail the measurement times to a fixed value. This may avoid overlap of the nodes sending slots in direction to the gw as much as possible.
    • last step could be a review on powering issues, seems Node_1 at some point in time suffered from issues wrt that; maybe there are other nodes with similar effects too (all nodes have a lot of wires attached).


  • Short update, thx for reminding me there's still work to do 😁:

    • Moved pullpup-/pulldown resistors (440Ohm) from one end of the network to the other (now: GW).
    • Replaced all MAX485 with MAX487 (all placed on LC-Modules, most of the resistors 5-7 are desoldered, the 120Ohm's remain only on GW and last node).
    • SOH-Count is now set to 3 on GW and Nodes 1 to 3 and 5, so only Node 4 (BME280) is remaining with default (1)

    At first sight, everything's working, and node 2 for now seems not to fail as soon as the last time before these changes; but as always: If this is really reliable over time, we'll see. So expect at least one more update, hope this will be the last πŸ˜€



  • I had quite strange problem with my smaller network. There is a nano with enc28j60 shield as gateway and there was 2 relay/fet-nodes for controlling lights. Everything was fine until now I added one light switch node to the network. After that only the light switch was working. Strange...

    Disconnected all the nodes from network and checked what is causing the problem. It was the GW. Measured the bus between A and B was ~2.5V. So it was pulling the bus to logical one all the time.

    Changed the RS485 module.. no help. Then measured the "MY_RS485_DE_PIN" what I was using pin 2. The enc28j60 shield was pulling the pin to 0.6V and that was causing the RS485 shield to drive bus to ~2.5V. I changed the pin to 3 and now everything is working like a dream. Of course none of these nodes are sending all the time so most likely there wont be any collisions.

    So when some node/bus is hanging next time measure the voltage of DE-pin πŸ˜„



  • I am thinking about changing my Mysensor-Nodes from NRF24L01+ to RS485 because of stability. But while reading this, I am not sure anymore, if this is a good idea
    Anyone has this up & running successfully already?



  • So once more a short update: No significant changes achieved by changing the transmitters and the placing of the resistors - only two of 4 nodes are working as expected 😭 , the 5th (Node_2) is still turned off to prevent possible interference with Node_1.

    As two of them are online since my last post (around 18 days), I'm quite sure, it's not a gw issue as @pjr reported, and as one of the nodes is powered from a different source than the other 3 and different than the GW, it seems also not to be powering related. So I'm a little running out of ideas how to further debug 😭 .

    Now', Im thinking about reverting Baudrate back to 9600 and - in case this will not help (what most likely will happen) - splitting up the bus to two lines, this may help to find out what is going on with individual nodes.

    So @otto001 At this point in time I'd say: It really depends...
    If you have only a few nodes (2-3+GW) you want to attach, RS485 is a simple and secure option. But as soon as there are more, one failing will affect the entire communication - that's really no fun. So stay with nRF24 (or other wireless transceivers) for nodes just sending in data and try RS485 with a few important switching/security relevant nodes first.

    Just my2ct...


  • Mod

    the problem with cables and signals is that every environment is different, cables are different , there are a lot of possible causes that can screw up communication on bus



  • @gohan I absolutely agree. Wrt to wiring: Most of the wires I use are twisted pairs of CAT6 network cables (one pair for signal, and - when distributing also 12V - one for 12V+GND). Some newer parts (trouble began before that) are 4 wire telefone wires with around the same copper diameter per single line.
    Connections: Just one Wago between GW and Node_1, the others are either directly screwed using the modules or build short stubs (<20cm) from a Wago clamp with three connections (in/stub to node/out).

    So if you see room for improvement, suggestions are welcome πŸ˜„


  • Mod

    Unfortunately I can't add much as I haven't had my hands on the RS485 network yet



  • @rejoe2

    That is what I all time suggest use CAN bus drivers instead of 485 bus drivers.
    CAN bus driver adds some safety, because disconnect microcomputer by hardware from bus, if it sends dominant state too long ( when program hangs etc. ).
    So single node cannot damage all communication on the bus.

    And try different node ID than 1 - 4.
    It maybe collides with packet wrapping characters, defined in standard ASCII table for 485 transport protocol in wrong situation.

    #define SOH 1
    #define STX 2
    #define ETX 3
    #define EOT 4



  • @kimot Thx for reffering to CAN.
    Some questions and remarks on that:

    • The Node ID's assigned in reality are 97 and higher, the node-# mentionned here are just for simplyfing explanation by following the physical order they are attached to the bus.
    • How to setup a CAN network with MySensors? I saw some suggestions wrt to that in the past, but that seemed not to be "ready to use" code and hardware. So is there an option to just replace the MAX48x by a different chip and use the MyS-RS485 communication layer?

    I have some MCP2515 modules laying around, but these use SPI as connection towards the mcu and would require an appropriate communication layer in the sketches (at least as far as I understood).

    But in general, also standard RS485 claims to be robust and not rocket science tech. So it's really frustrating to experience that amount of problems and backdraws.

    EDIT: I found this thread: https://forum.mysensors.org/topic/5327/can-bus-transport-implementation-for-mys. Most likely, really understanding most of it's content will need a lot of rereading. But as far as I understood, integration of CAN still would need a lot of development?



  • @rejoe2
    I am not meaning CAN protocol.
    Only CAN bus drivers:
    ebay

    RS485 is robust, but nod designed for multimaster communication, when two nodes can ocupy bus at the same time. CAN bus drivers are designed for this situation.

    You can use CAN drivers like 485, only forgot about RE, DE.
    CAN bus driver always listens.
    And you must use higher speeds ( 57 600 ) because driver cut of controller if it sends dominant state longer then 250 ΞΌs ( byte 00hex must be send quickly then this timeout )
    Or use MCP2551, where this time is 1.25 ms. ( 9 600 )
    ebay


  • Mod

    did you try if they work well with mysensors?



  • @kimot Thanks for clarification, just ordered a bunch of TJA1050 modules and a couple of naked MCP2551 (seem to be pin compatible with the TJA's) 😁 . That may take some time for all the way from china (new year is coming...).
    Next step then will be to change Baudrate to 57600 (seems to be the upper limit when using software-serial - as needed for my Nano-GW. Btw.: I did some really disappointing tests with a pro micro-GW, but that seemed not to work, I most likely will have to make another attempt on this to use HW-serial 😁 ).

    Then I'll replace the MAX48x-modules by these TJA's and see, if everything's fine then.

    Just one remark: If it's as easy as that, wouldn't it be good to just recommend using that type of module as a standard instead of the problematic standard RS485 types?

    EDIT: One more question: Mixing both (or all three) types of transceiver should be possible, or am I wrong? (This would not completely eliminate the MAX485-disadvantages, that's clear to me)



  • @rejoe2 :
    Thanks a lot!
    I am having some unusual sketches (washing mashine monitoring, entry-system with fingerprint) where I am using self-written sketches with mysensors). Maybe I should wait. It is just annoying that the radio-stuff is not always working as it should. Indeed, esp with mqtt could be a solution too. But I do not know yet about the stability of esp-stuff 😞

    Cheers,
    Otto



  • @pjr Is the network still operating ok? Curious to see if anyone has it fully operational as of yet.



  • @mick Its still in use. I had one bus freeze since last "update post".

    The gateway did pull the line up to 190mV what seems to be enough to get traffic frozen. After "reboot" of the gateway it was still pulling the bus to 160mV so it must be some solder on wrong place and bad chinese pin headers or protoboards.. I have to rebuild the "motherboard" of the gateway..



  • @pjr that sounds promising. My Serial gateway and one node ran for months with no problems. I’ve since added a node in between the two and that’s when I started having issues. Removing resistors on the middle node (terminating, pull-up and pulldown) and removing the pull-up and pull down on the end node seem to have fixed the issues for now however when I added an extra 20m to the cable run I started having issues again. I’de Love to get this network working 100% but I think a lot persistence and patients is needed! πŸ™‚



  • Hints on RS-485 networks:

    Termination resistors
    You always need them. Termination resistors should be added to the nodes located at the ends of the line. The communication may works without them if the wire is short enough and/or the bit rate is low.

    Pull up/down resistors a.k.a failsafe bias resistors
    Why you need it is well explained here: https://electronics.stackexchange.com/a/284788/88486
    When you need it: It depends on the RS485 transceiver IC. Most modern transceivers include these.

    Common ground
    Do you have a common ground between your transceivers? RS-485 is not a 2 wire network. Besides the A-B lines it requires ground.
    See: http://store.chipkin.com/articles/rs485-rs485-cables-why-you-need-3-wires-for-2-two-wire-rs485
    Schematics at http://www.analog.com/media/en/technical-documentation/application-notes/AN-960.pdf page 4.

    Isolation
    When you deal with long links, you have to take care of isolation.
    Page 8 at http://www.analog.com/media/en/technical-documentation/application-notes/AN-960.pdf



  • @rejoe2 - have you received the CAN modules yet? I’m very interested to see how they go



  • @mick Thanks for coming back to this topic, the stuff arrived already some weeks ago, but unfortunately I was busy with other parts, amongst them btw. a modification of the MySensors plugin for FHEM to get probems shown more easily. So finally some soldering work was on my list yesterday. So today there's the option to at least give some first impressions on that:

    • Network is @57600 for the use of the TJA1050 modules.
    • Soldered some adopter boards for direct replacement of the modified LC-Tech MAX487 modules, (crossed RX/TX as usual, there's no DE/RE, later also desoldered the 120 Ohm resistor)
    • plugged it in Node_2's socket, and

    WOW: It worked!

    Tried a second one on the same node (module without the 121R): Also worked πŸ˜„

    So I continued and tried to teplace the transceiver at Node_3 with the tested (now resistorless) one:

    NO LUCK!

    Continued with the last transceiver in line holding the pullup/dow resistors: Also no luck πŸ˜’

    So I decided to stop at that point to see how Node_2 will perform over night and meditate on the question why the behaviour on the other two nodes is different. Kept Node_1 and Node_4 not connected to the bus, so starting the experiment just with 3 nodes and the GW online.

    This mornings's findings:

    • (Node_2 hat at least once spontanously rebooted, but still was sending in data. I's rate this as some kind of partly success, but obviously the node needs to be reviewed, there most likely exists a problem not related to RS485.)
    • (Node_5 was offline for whatever reason, no issues - as always - on Node_3)
    • These transceivers can be used for MySensors@RS485
    • They can be used in a mixed network together with normal transceivers
    • BUT: (hypothesis) They may not be used in combination with altSoftSerial; Node_2 is HW-Serial, the other two still use software, so for them DE/RE might be essential (ISR?)

    Additional remarks: I tried also a different power source for the 12V a couple of weeks ago - no difference (MAX487-only bus)
    The GW was online during all of the time, it's uptime is 7 days by now (also SW-serial).

    So I'll have to do some more testing on some of the topics and perhaps also build some more test nodes to avoid touching my "normal" bus too often - this seems to also cause additional trouble...

    I'll come back when I know more, but again, this will take some time.

    @bakcsa Thanks for this great summary!



  • I think the problem is not from the RS485 side, not from cabling etc. I have similar problems on my RS485 network, some nodes stop sending data for no reasons after some time (sometimes 2 days sometimes 4 weeks). I have changed from AltSoftSerial to HWSERIAL and its the same.. Maybe its about ENABLE PIN 2 ?? Anyone try to change this pin to other ??



  • @nofox Did you measure voltages on the bus?

    Some additional info: I did some futher experiments using also MCP2551 as transceiver chip. As they are PIN-compatible to the TJA's, it was just a modification of the modules. I can confirme thess also to work.
    General "restriction" is one has to use HW-serial. For me, that's ok, but if you want serial output for debugging, you also have to use altSoftSerial and use it for debug output.

    This morning, I replaced the end-of-line module with a resistorless one (so also NO termination resistor is used on this side, only at the GW!).

    At least, the Bus is working for some minutes now @56700 with just GW (MAX487) and one node for each of the tranceiver chips (MAX487, MCP2551 and TJA1050).

    If that works more reliable, I'll switch all nodes to MCP2551 and lower the transmission rate, most likely to 19200 Baud. Will take again some time, but I'll keep you updated. For this I plan to use a STM32F103 as GW, if possible, I'll try to expose also the other two Serial Interfaces to the OS, so one could attach up to two other GW's (for other physical transport layers) to the STM using just one USB connection. But that's another project on another planet and if there's someone out there with more experience on how to do that: Thank's a lot...



  • Anyone know how to implement this https://github.com/MichaelJonker/HardwareSerialRS485/wiki to mysensors ? It can operate in multimaster mode and avoid collision issues.



  • Found this some days ago: https://github.com/mysensors/MySensors/pull/1142

    Adding these changes to a 2.3.0-alfa base seem to make a big difference in reliability πŸ˜€ .

    So if there's other users having similar problems: please make also some testing of this patch.

    You will get further updates and some more info on my recent setup, so far: Thanks a lot for all the ideas and hints to improve things!



  • As everything still seems to work as expected some futher remarks on my findings/hypothesis and todays setup:

    General remark: As a lot of things have changed over time and some of my tests turned out to be contraproductive, it's hard to sort out THE root cause now. But as others had some issues with colliding messages too, I'd bet on that and really appreciate if the patch #1142 would find it's way to everyones codebase (@seeers seems to have some issues at github, can one of the mods help him out of that, please?) πŸ˜€.

    Back to my setup:

    • Nodes are powered now through a central 5V DC supply, only 3.3V conversion remains locally, GW through USB
    • GND of central 5V is not connected to GW GND (in case of trouble, I'd add a "resistored" connection)
    • All nodes use Hardware-Serial, Baudrate is @19200
    • No debugging messages on nodes activated (if someone wants/needs it: swap debugging output to altsoftserial...)
    • Transceiver used: MCP2551, most of them on modded TJA1050-Boards (don't forget to dissolder the R120) => no DE-Pins necessary
    • Gateway is a Pro Micro
    • Termination resistors: 2kOhm at the last node in line (CANhi->5V and CANlo->GND), 120Ohm (A-B) only at last node and GW

    Remarks:

    • My long-time GW was a regular MAX485/MAX487-Nano using AltSoftSerial. IMO this over time had been one of the most reliable components in my setup. Conclusion: Most likely this lib is NOT to blame for any trouble I ran into.
    • Nevertheless hanging nodes with regular RS485 trasceivers caused problems to some extend. So I really like the CAN logic to switch the transceiver off when it's MC seems not to work properly to avoid infection of communication from other nodes. What I didn't test yet: Using CAN transceivers together with altSoftSerial. So expect some additional info on that later, as this might be helpful for debugging over USB as on any other node
    • To some extend all of the transceivers seem to be able even to transmitt in case A-B voltage level is above the "critical value" - depends a little on the strength of the power source. So I'd see this just as some type of "most likely" indicator of a bus problem now.

    Hope someone might find that summary helpful and once again:

    Thanks a lot to all those people here and at FHEM-forum trying to help me out of that never ending mys(t)ery! 😁 πŸ‘



  • Thanks for helping @seeers at Github.

    Still no communication issues to report, so just two additional remarks on the MCP2551 usage following some short tests on that:
    They also seem to work at 9600 Baud and also using the MySensors standard (AltSoftSerial at PINs 8 and 9) is possible.

    So for now, I'd recommend everybody thinking about new nodes to give this type of hardware a try. The only disadvantages may be

    • the limitation in payload - but that seems not to be any practical issue in the use in an MySensors environment
    • they might be more sensitive in crossing BusHigh and BusLow connection, so make sure all highs are on the same wire and also all lows 😁

    My favorites are kind of violet modules. Didn't come across them earlier, so I'd see all PINs on one side as a modest point for improvement. As there are also no resistors at all, just pads for soldering, there's also not the need of starting with desoldering unnecessary parts - just make sure, you have the right stuff to add 😜 .



  • Everything's working like a charm now πŸ˜€ , so thread is marked as [solved] now.
    One more big "Thank you" to everybody helping to get this finally done 🎊 πŸ‘ !

    In addition a picture of my Node 2 - the one I did the most rework over time until now also showing the violet MCP2551 module mentionned above.
    0_1530172133491_MCP2551_Node_A.PNG


 

294
Online

7.6k
Users

8.5k
Topics

91.2k
Posts