Problems with ENC28J60 losing connection/freezing (using UIPEthernet or etherShield)? READ THIS!
-
Well, to me it was actually @ntruchsess latest UIP-release prior to fix_errata12 that did the real fix. I had stoped using my Enc-Uip-gw a few weeks before due to increasingly frequent hungups (from weekly/daily to hourly). (I wrongly thought then that it was a memory-issue and planed to try without bootloader.). Unfortunately I don't remember my uptime before fix_errata. I just know it was more than enough to conclude that (one) problem was solved.
I'm using this Nano and this Enc28j60. I have 10 nodes with 1-3 sensors. Mean update period ~2min. Datamine 40 channels to NAS.
-
Well, to me it was actually @ntruchsess latest UIP-release prior to fix_errata12 that did the real fix. I had stoped using my Enc-Uip-gw a few weeks before due to increasingly frequent hungups (from weekly/daily to hourly). (I wrongly thought then that it was a memory-issue and planed to try without bootloader.). Unfortunately I don't remember my uptime before fix_errata. I just know it was more than enough to conclude that (one) problem was solved.
I'm using this Nano and this Enc28j60. I have 10 nodes with 1-3 sensors. Mean update period ~2min. Datamine 40 channels to NAS.
@m26872 that sounds interesting, what kind of sensors do you use? Also, how do you power the enc28? Im using a 3.3V Mini Pro and this enc28. Vraw on the Arduino and Vcc on the enc28 breakout are connected to 5V from the USB adapter. Currently my sensors (DS18B20) are connected directly to my computer for logging, but my plan is to connect them via this Arduino instead.
-
Shouldn't the ENC28J60 be connected to 3.3V. I know it is 5V resistant, but still. May be it makes a difference ?
@frol Have you been able to run some tests with ESTAT ? -
@m26872 that sounds interesting, what kind of sensors do you use? Also, how do you power the enc28? Im using a 3.3V Mini Pro and this enc28. Vraw on the Arduino and Vcc on the enc28 breakout are connected to 5V from the USB adapter. Currently my sensors (DS18B20) are connected directly to my computer for logging, but my plan is to connect them via this Arduino instead.
@frol Then maybe you should try a Nano instead. I tried really hard to get i working with my Arduino Pro Minis (3.3v, 5V or both, don't remember) at first, But with Nano it worked right away. I know it sounds silly....and I can't explain why, maybe power issue? Now the Nano is powered from "wall wart" to its usb and the enc-module from Nanos 5V-pin.
My nodes look like this with some variations. Sensors used are mostly DS18B20 (one or more per node), DHT22, BMP180, and digital switch.@Thomas-Ihmann My Enc-module (linked above) has "5V" printed on the pin and thats what it wants. I've tried 3.3V but then the power-led and all is dead.
-
@frol Thank you for the data.
Well, it looks as if the workaround for errata12/13 doesn't work as described. TXERIF is never set in the 4 uip_debug-logs :-(, hence the timeout (BTW: it would make sense to return false in case timeout occurs so a packet might be retransmitted in that case. As TXERIF is not set sendPackage returns true in case of error). Strange thing is that even if a packet is not transmitted (and that is not detected) the next outgoing packet will reset transmitlogic anyway - but this doesn't seem to re-enable transmission, does it? (Any output truncated after the timeout?)
Maybe one should poll ESTAT for TXABRT instead of TXERIF (or both at the same time)?- Norbert
@ntruchsess I now added polling of ESTAT also. When the timeout happens nothing different is seen (no TXABRT). I do see some TXERIF in my logs, seemingly caused by collisions, and it looks like they are handled correctly. I pushed new logs and diff to github, and I would be very grateful for any more tips.
Currently I am testing to return false when the timeout happens as you suggested. I'll let you know the results tomorrow.@Thomas-Ihmann the enc28 chip can only handle 3.3v, but the breakout board I use have a dedicated 3.3v regulator.
@m26872 maybe I should try a different Arduino.. But I still believe it these should work, so I won't give up yet. I also suspected power issues before, but now I have one 3.3v regulator for the Arduino, and one for the enc28, so I don't think that should not be an issue anymore.
Tjusig sensor node, väldigt lik den design jag tänkt mig på mina sådana (till och med samma låda etc..) :-) -
I'm using an Arduino Nano and still they are loosing connection regularly :-(
-
I'm using an Arduino Nano and still they are loosing connection regularly :-(
@Thomas-Ihmann About how often? ~4hrs every time? Does it come back by itself and if so what's the downtime?
-
After 2-8 hours. After that I have to repower. Before that it doesn't respond to Ping.
-
@ntruchsess after I added the timeout my sketch has been running for almost 24h without crashing. The timeout has been triggered 19 times or so, and the library has recovered nicely. I commited my changes to my fork of the repo:
https://github.com/frolswe/arduino_uip/tree/fix_errata12I don't really think this is a fix to the problem. The hang should not happen at all, but maybe we can use it as a workaround.
@Thomas-Ihmann please try this branch, hopefully your issue are the same as mine :)
-
@frol I will try it as soon as I am back at home, probably tomorrow evening. At the moment I am abroad without access to the Arduino..
-
@frol Thank you for the data. Bad thing TXABRT is not set when timeout occours. I have to give this workaround a second thought. I hope I can somehow avoid the 1sec busy wait. It's not about the 1 sek of having unresponsive ethernet, but it stalls any other processing during that time as well :-(
Did merge your pullrequest so others may test it easily.
- Norbert
-
@ntruchsess Thank you very much. I have uploaded the new version to the three Arduinos just now. So we just have to wait and see....
-
All Arduinos: Running since almost 12 hours, looking good so far :-) :+1: . I will keep you updated. Update: 18h and no lost connection Update 2: 24h and no lost connection, looking good
-
All Arduinos: Running since almost 12 hours, looking good so far :-) :+1: . I will keep you updated. Update: 18h and no lost connection Update 2: 24h and no lost connection, looking good
@Thomas-Ihmann that sounds great, thanks for reporting back :-)
I'm not so lucky. But I am pretty sure my current problems are unrelated to this bug. I haven't had time to debug it further though.
-
I was unlucky as well, after 25h I good a disconnection with one arduino, though that is the one which used to disconnect after a few hours (I am using the arduinos in connection with FHEM). Any ideas for further investigations ?
-
Hmm..., guess mine isn't perfect after all.
At 2.30 last night my sensors had stoped reporting. After repower the gateway and reloaded vera-luup it was up again. Both Vera and Gateway has been running continously for at least a week before this happend and it's not impossible it's related to my use of Datamine-plugin prior to this (due to the datamine-to-nas logging). I'll keep a good look at it from now on.. -
@frol, @Thomas-Ihmann, @m26872
My sensors are still working after I changed the code according to my own findings and solution from the first post. That means I don't have a while() loop which could hang the whole system. In my case I probably don't always get feedback if the packet transmitted OK or not and I probably will lose some packets too. But my gateway never hung itself either :).
I might have some hints for debugging you guys may or may not already have checked.
- Don't just reset the gateway, try other things first.
- Like pinging the gateway
- If you can ping the gateway, try Reload(ing) the Vera system which also will restart the MySensors-plugin and re-establish the connection. Check if this helps.
- Make sure the sensor(s) are still working, start the serial monitor on the gateway and check for incoming messages (might add some debugging code for that)
- DON'T activate the DEBUG-flag in MyConfig.h because this definitely will 'break' the gateway.
- In case you are using a pressure-sensor BMP085/180 make sure NOT to use the sample() function because this will/might make your sensor hang after 180minutes.
I didn't test Norberts fix myself, might do this when I feel for it... I have my system running perfectly now and are logging with DataMine so will rather not f... this up.
-
I discovered that if I power on my Vera and Enc-gateway without my nas connected (but it's still mounted to be used by datamine) the gateway will no longer respond to ping.
@MagKas I will try your while-loop solution... when I feel for it. :-) -
@frol, @Thomas-Ihmann, @m26872
My sensors are still working after I changed the code according to my own findings and solution from the first post. That means I don't have a while() loop which could hang the whole system. In my case I probably don't always get feedback if the packet transmitted OK or not and I probably will lose some packets too. But my gateway never hung itself either :).
I might have some hints for debugging you guys may or may not already have checked.
- Don't just reset the gateway, try other things first.
- Like pinging the gateway
- If you can ping the gateway, try Reload(ing) the Vera system which also will restart the MySensors-plugin and re-establish the connection. Check if this helps.
- Make sure the sensor(s) are still working, start the serial monitor on the gateway and check for incoming messages (might add some debugging code for that)
- DON'T activate the DEBUG-flag in MyConfig.h because this definitely will 'break' the gateway.
- In case you are using a pressure-sensor BMP085/180 make sure NOT to use the sample() function because this will/might make your sensor hang after 180minutes.
I didn't test Norberts fix myself, might do this when I feel for it... I have my system running perfectly now and are logging with DataMine so will rather not f... this up.
@MagKas great your sensors are working correctly.
My sketch to debug this is basically a loop that each second opens a tcp-connection to my server, sends a timestamp, and closes the connection, so not much else is involved. I think my current problem is that not all code-paths in the library calling sendPacket() correctly handles sendPacket() returning FALSE. The reason I say this is that I can see the memhandle used when calling sendPacket() increasing when FALSE is returned in some cases. Eventually the library runs out of available memhandles and my sketch stops working. (This is mostly speculations, as I haven't debugged / understood it enough yet).
Is the error return value in sendPacket() needed?
Even if the packet is sent correctly, it may be dropped by the network, thus the library still needs to handle lost packets. Knowing that a packet failed to send is just a special case of lost packets that I don't think really needs to be handled separately. @MagKas solution without the while-loop basically does this by returning OK even when the enc-chip transmit logic freezes and don't send the packet, and that appears to work. The enc-chip will always be reset before sending the next packet, so it won't hang in any strange state. (I can imagine one problem when sending large packets and re-entering sendPacket() before the previous packet has finished transmitting. I don't know if this actually happens in real life.)
Maybe @ntruchsess or anyone else can educate me as to why the return value is needed or why it is a bad idea to remove it?