Problems with ENC28J60 losing connection/freezing (using UIPEthernet or etherShield)? READ THIS!
-
Hi all!
I just found a bug/flaw in the code for the ENC28J60 chip which is widely used in combination with Arduino cards.
The problem I encountered was that the MySensors plugin in my Vera Lite controller stopped uppdating values from sensors in a unexplainable way. First I thought it was because of the used NRF2401L radios which can become unstable in some cases. A restart of the gateway solved the problem temporarily.
But while debugging I noticed that the messages from the sensors were coming in in the gateway but not to the plugin. The gateway didn't hang but the ethernet part controlled by the ENC28J60 chip just stopped sending packets to the Vera Controller (MySensors plugin). Best way to test this is to ping the gateway from your PC. If you don't get a response you will probably have the same problem as I had.
This 'hanging' would sometimes occur after some minutes and sometimes it could take hours between 2 hangings.
After lots of hours of debugging and searching the internet I found that there are a lot of people experiencing the same problem and most of them simply changed to a W5100 based ethernetcard. Which also seem to have som problems in combination with the used NRF2401L radio, but that's a different story.
After simplifying the sketch in the gateway so that it is not a real gateway anymore but just sends a 'temperature message' for Node 1 to the gateway every second. This way I was certain that the problem was only in the ENC28J60 chip or code.
And now to the solution:
Most of the developers which have made or adjusted libraries for the ENC28J60 chip are more or less aware of the problems which comes with this chip and which are written down in the following document from Microchip:
http://ww1.microchip.com/downloads/en/DeviceDoc/80349c.pdf
It is the ENC28J60 Silicon Errata and Data Sheet Clarification and the above problem is caused by point 12 (and maybe even point 14?).Looking at the code for the sendPacket() function (in Enc28J60Network.cpp) in the UIPEthernet library you will find some lines of code which ought to take care of this problem but this fix is implemented wrong:
// TX start writeRegPair(ETXSTL, start); // Set the TXND pointer to correspond to the packet size given writeRegPair(ETXNDL, end); // send the contents of the transmit buffer onto the network writeOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRTS); // Reset the transmit logic problem. See Rev. B4 Silicon Errata point 12. if( (readReg(EIR) & EIR_TXERIF) ) { writeOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXR**TS**); }
In the errata document it says that you have to "reset the internal transmit logic" BEFORE setting the TXRTS flag. In the code above you will see that the reset code comes AFTER the code for setting this flag. The second problem is that it also is the wrong flag/bit that is used for this, namely the ECON1_TXRTS bit in stead of the supposed reset bit ECON1_TXRST. It's just a mix-up of the last two letters S and T but therefore it doesn't work at all.
Because all of the libraries available for the ENC28J60 are based on eachother the faulty fix has been copied all the time. After looking for just this fix I found that there are a few different versions of the fix and the ones from tuxgraphics.org, EtherCard and NanodeUIP libraries are the same and the best ones.
The only problem I encountered when implementing their fix is that a few times per day my whole Arduino hangs (deadlock). I also have the same problem with my Weather station based on the tuxgraphics board. There I 'solved' this by enabling the watchdog. I removed their while() loop and the problem of the hanging Arduino disappeared.So, after this VERY long explanation my solution to everyone experiencing problems with this ENC-chip is to change or add the following lines to the enc28j60.cpp or Enc28J60Network.cpp file for the function sendPacket(), like this:
// Check no transmit in progress // while (readOp(ENC28J60_READ_CTRL_REG, ECON1) & ECON1_TXRTS) // Might lead to deadlocks and not explicitly advised by Microchip Errata point 12 so commented out this, MagKas 2014-10-25 // { // Reset the transmit logic problem. See Rev. B4 Silicon Errata point 12. if( (readReg(EIR) & EIR_TXERIF) ) { writeOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRST); writeOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRST); writeOp(ENC28J60_BIT_FIELD_CLR, EIR, EIR_TXERIF); // Might be overkill but advised by Microchip Errata point 12, //MagKas 2014-10-25 } // }
This code has to come BEFORE setting the ECON1_TXRTS flag according to Microchip. And don't forget to remove all other existing code in sendPacket() function that tried to fix this problem.
Please let me know if this solved your problem!
I already contacted Guido Socher (Tuxgraphics.com), Jean-Claude Wippler (EtherCard), Norbert Truchsess (UIPEthernet), Pascal Stang (AVRLib), Stephen Early (NanodeUIP) and Jonathan Oxer (etherShield) about this and asked them to verify my findings and if necessary to update their libraries.
Of the above libraries, UIPEthernet and etherShield are the ones that have the wrong implementation of this fix. The others have implemented correct but with the While() statement which accordning to me could lead to deadlocks.
With best regards,
Magnus Kasper, Sweden
-
Great finding Magnus!
@ntruchsess has been seen on this forum lately.
-
Thanks hek!
Does anyone have a valid emailadress of Stephen Early? Tried the one I found on github but are getting a 'Failure Notice'.
-
-
@MagKas I did investigate a bit - the implementation in UIPEthernet was based on an older Version of Silicon Errata. Rev B7 has clarified this issue in more detail, in fact Issue 13 contains pseudocode that also should solve your 'deadlock' on 'while (readOp(ENC28J60_READ_CTRL_REG, ECON1) & ECON1_TXRTS)'
The issue is that eventually TXRTS is not (never) cleared by the transmission-logic after package transmission so the while will never exit. As a workaround the code should wait for either TXIF or TXERIF to be set.Here is my code that I just commited to UIPEthernet (https://github.com/ntruchsess/arduino_uip/blob/fix_errata12/utility/Enc28J60Network.cpp#L233
// Reset the transmit logic problem. See Rev. B7 Silicon Errata issues 12 and 13
writeOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRST);
writeOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRST);
writeOp(ENC28J60_BIT_FIELD_CLR, EIR, EIR_TXERIF | EIR_TXIF);
// send the contents of the transmit buffer onto the network
writeOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRTS);
// wait for transmission to complete or fail
while (((eir = readReg(EIR)) & (EIR_TXIF | EIR_TXERIF)) == 0);
writeOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRTS);The retransmission-logic that is described in Issue 13 is implemented outside of sendPacket-method. On transmission-error it returns false, the packet will not be freed in UIPEthernet::network_send() and transmission will be reattempted on next call to UIPEthernet.tick().
I also added a fix that allocated the 7 bytes of transmit-status-vector to prevent corruption of other outstanding packets. Code is in branch 'fix_errata12' (https://github.com/ntruchsess/arduino_uip/tree/fix_errata12). Maybe you'd like to test before I release?
- Norbert
-
This looks great, thanks Magnus.
I am currently testing one of my projects using the fix_errata12 branch of UIPEthernet. It has been running for more than an hour with no problems, while it previously froze very often.
Hopefully this solves my enc28 problems
Thanks
/FredrikEDIT:
Froze after three hours or so. Still responds to ping, but not tcp connections.
-
@frol thank you the feedback. If it still responds to ping it indicates enc28j60 transmit-logic does not stall and the changes in fix_errata12 branch seem to work. I think the freeze must have a reason that is unrelated to this low-level patch. What sketch are you using for testing?
-
@ntruchsess (Sorry for the delay..)
Thats kind of what I expected, didn't seem right that it wasn't completely hung. I tried to reproduce it using a simpler sketch, and couldn't get to the "responds to pings" state. But several times my new simple sketch has hung completely in sendpacket() waiting for the send to complete. To help debug it, I set the ENC28J60DEBUG define and added a timeout in sendpacket() after 10 seconds. To me this looks like what is described in errata12, but the workaround implemented in the branch looks correct, so I dont knowMy sketch is available here. Do you have any tips what I could try to understand this better?
https://github.com/frolswe/uip_debug
-
@frol Thank you for the data.
Well, it looks as if the workaround for errata12/13 doesn't work as described. TXERIF is never set in the 4 uip_debug-logs :-(, hence the timeout (BTW: it would make sense to return false in case timeout occurs so a packet might be retransmitted in that case. As TXERIF is not set sendPackage returns true in case of error). Strange thing is that even if a packet is not transmitted (and that is not detected) the next outgoing packet will reset transmitlogic anyway - but this doesn't seem to re-enable transmission, does it? (Any output truncated after the timeout?)
Maybe one should poll ESTAT for TXABRT instead of TXERIF (or both at the same time)?- Norbert
-
I tried it with a sketch. After approx. 4 hours the arduino disappeared. No ping, no TCP communication. There has to be a different problem...
-
@ntruchsess I'll try checking ESTAT as well (probably tonight). There is no truncated output since I deliberately stopped with a while(1); whenever the timeout happened. I can try to return false instead and see if everything recovers.
/Fredrik
-
I've been runnning the fix_errata12 since the day after release without any freeze or lost connection. (Maybe one or two restarts for other reasons.) I only have some inclusion mode issues but probably not related to this.. (Using VeraLiteUI5).
-
@m26872 that is good to hear, it is possible to get enc28 stable Maybe I should try with a different switch / network environment.
Did you have the problem with hangs prior to the fix_errata12 branch?
-
Well, to me it was actually @ntruchsess latest UIP-release prior to fix_errata12 that did the real fix. I had stoped using my Enc-Uip-gw a few weeks before due to increasingly frequent hungups (from weekly/daily to hourly). (I wrongly thought then that it was a memory-issue and planed to try without bootloader.). Unfortunately I don't remember my uptime before fix_errata. I just know it was more than enough to conclude that (one) problem was solved.
I'm using this Nano and this Enc28j60. I have 10 nodes with 1-3 sensors. Mean update period ~2min. Datamine 40 channels to NAS.
-
@m26872 that sounds interesting, what kind of sensors do you use? Also, how do you power the enc28? Im using a 3.3V Mini Pro and this enc28. Vraw on the Arduino and Vcc on the enc28 breakout are connected to 5V from the USB adapter. Currently my sensors (DS18B20) are connected directly to my computer for logging, but my plan is to connect them via this Arduino instead.
-
Shouldn't the ENC28J60 be connected to 3.3V. I know it is 5V resistant, but still. May be it makes a difference ?
@frol Have you been able to run some tests with ESTAT ?
-
@frol Then maybe you should try a Nano instead. I tried really hard to get i working with my Arduino Pro Minis (3.3v, 5V or both, don't remember) at first, But with Nano it worked right away. I know it sounds silly....and I can't explain why, maybe power issue? Now the Nano is powered from "wall wart" to its usb and the enc-module from Nanos 5V-pin.
My nodes look like this with some variations. Sensors used are mostly DS18B20 (one or more per node), DHT22, BMP180, and digital switch.@Thomas-Ihmann My Enc-module (linked above) has "5V" printed on the pin and thats what it wants. I've tried 3.3V but then the power-led and all is dead.
-
@ntruchsess I now added polling of ESTAT also. When the timeout happens nothing different is seen (no TXABRT). I do see some TXERIF in my logs, seemingly caused by collisions, and it looks like they are handled correctly. I pushed new logs and diff to github, and I would be very grateful for any more tips.
Currently I am testing to return false when the timeout happens as you suggested. I'll let you know the results tomorrow.@Thomas-Ihmann the enc28 chip can only handle 3.3v, but the breakout board I use have a dedicated 3.3v regulator.
@m26872 maybe I should try a different Arduino.. But I still believe it these should work, so I won't give up yet. I also suspected power issues before, but now I have one 3.3v regulator for the Arduino, and one for the enc28, so I don't think that should not be an issue anymore.
Tjusig sensor node, väldigt lik den design jag tänkt mig på mina sådana (till och med samma låda etc..)
-
aja baja.. ingen svenska
-
I'm using an Arduino Nano and still they are loosing connection regularly
-
@Thomas-Ihmann About how often? ~4hrs every time? Does it come back by itself and if so what's the downtime?
-
After 2-8 hours. After that I have to repower. Before that it doesn't respond to Ping.
-
@ntruchsess after I added the timeout my sketch has been running for almost 24h without crashing. The timeout has been triggered 19 times or so, and the library has recovered nicely. I commited my changes to my fork of the repo:
https://github.com/frolswe/arduino_uip/tree/fix_errata12I don't really think this is a fix to the problem. The hang should not happen at all, but maybe we can use it as a workaround.
@Thomas-Ihmann please try this branch, hopefully your issue are the same as mine
-
@frol I will try it as soon as I am back at home, probably tomorrow evening. At the moment I am abroad without access to the Arduino..
-
@frol Thank you for the data. Bad thing TXABRT is not set when timeout occours. I have to give this workaround a second thought. I hope I can somehow avoid the 1sec busy wait. It's not about the 1 sek of having unresponsive ethernet, but it stalls any other processing during that time as well
Did merge your pullrequest so others may test it easily.
- Norbert
-
@ntruchsess Thank you very much. I have uploaded the new version to the three Arduinos just now. So we just have to wait and see....
-
All Arduinos: Running since almost 12 hours, looking good so far . I will keep you updated. Update: 18h and no lost connection Update 2: 24h and no lost connection, looking good
-
@Thomas-Ihmann that sounds great, thanks for reporting back
I'm not so lucky. But I am pretty sure my current problems are unrelated to this bug. I haven't had time to debug it further though.
-
I was unlucky as well, after 25h I good a disconnection with one arduino, though that is the one which used to disconnect after a few hours (I am using the arduinos in connection with FHEM). Any ideas for further investigations ?
-
Hmm..., guess mine isn't perfect after all.
At 2.30 last night my sensors had stoped reporting. After repower the gateway and reloaded vera-luup it was up again. Both Vera and Gateway has been running continously for at least a week before this happend and it's not impossible it's related to my use of Datamine-plugin prior to this (due to the datamine-to-nas logging). I'll keep a good look at it from now on..
-
@frol, @Thomas-Ihmann, @m26872
My sensors are still working after I changed the code according to my own findings and solution from the first post. That means I don't have a while() loop which could hang the whole system. In my case I probably don't always get feedback if the packet transmitted OK or not and I probably will lose some packets too. But my gateway never hung itself either :).
I might have some hints for debugging you guys may or may not already have checked.
- Don't just reset the gateway, try other things first.
- Like pinging the gateway
- If you can ping the gateway, try Reload(ing) the Vera system which also will restart the MySensors-plugin and re-establish the connection. Check if this helps.
- Make sure the sensor(s) are still working, start the serial monitor on the gateway and check for incoming messages (might add some debugging code for that)
- DON'T activate the DEBUG-flag in MyConfig.h because this definitely will 'break' the gateway.
- In case you are using a pressure-sensor BMP085/180 make sure NOT to use the sample() function because this will/might make your sensor hang after 180minutes.
I didn't test Norberts fix myself, might do this when I feel for it... I have my system running perfectly now and are logging with DataMine so will rather not f... this up.
-
I discovered that if I power on my Vera and Enc-gateway without my nas connected (but it's still mounted to be used by datamine) the gateway will no longer respond to ping.
@MagKas I will try your while-loop solution... when I feel for it.
-
@MagKas great your sensors are working correctly.
My sketch to debug this is basically a loop that each second opens a tcp-connection to my server, sends a timestamp, and closes the connection, so not much else is involved. I think my current problem is that not all code-paths in the library calling sendPacket() correctly handles sendPacket() returning FALSE. The reason I say this is that I can see the memhandle used when calling sendPacket() increasing when FALSE is returned in some cases. Eventually the library runs out of available memhandles and my sketch stops working. (This is mostly speculations, as I haven't debugged / understood it enough yet).
Is the error return value in sendPacket() needed?
Even if the packet is sent correctly, it may be dropped by the network, thus the library still needs to handle lost packets. Knowing that a packet failed to send is just a special case of lost packets that I don't think really needs to be handled separately. @MagKas solution without the while-loop basically does this by returning OK even when the enc-chip transmit logic freezes and don't send the packet, and that appears to work. The enc-chip will always be reset before sending the next packet, so it won't hang in any strange state. (I can imagine one problem when sending large packets and re-entering sendPacket() before the previous packet has finished transmitting. I don't know if this actually happens in real life.)
Maybe @ntruchsess or anyone else can educate me as to why the return value is needed or why it is a bad idea to remove it?
-
For TCP you are right - the boolean return-value is not required as the library does free memory not before a packet is acknowledged or the connection is closed. I guess here I can remove some code from the lib. UDP does not retransmit and would loose packets just because of collisions. UDP should loose packets only when they time out or get dropped due to physical failure.
-
Hi Everyone,
I wanted to report my experience here as I thought you all would probably be able to understand it far better than I can, and maybe even find it interesting. I'm a newbie, so bare with me...
For the longest time, I was using the UIPEthernet library 'improperly'. I wanted to open a connection and keep it opened(indefinitely if necesssary), and "stream" my data to my program - a long string of text parameters, about 200 bytes. The easiest way to do this seemed to be just simply not disconnect. Using the basic examples as a starting point however, this results in the need to "ping" the Arduino every time you want data, resulting in a perpetual game of "ping/pong". The connection never actually closes, but using client = server.available(); to call your actions causes this I guess, since it only returns 1 if there is data waiting to be read. This is what all of the examples seemed to use, so I thought it was the right way. Adding fuel to the flames, I found a blog that seemed to call this a bug, and showed how to fix it with the stock Ethernet library, so I thought I was on the right track. I wanted to use UIPEthernet though, and the ping/pong scheme seemed workable, if kludgey, so I implemented it. I had it working for a long time - a couple of months - without any trouble overall. I mean, I could open a telnet session with PuTTY and have it sit there open for days and randomly "ping" the arduino and get a packet.
However, I just knew it didn't seem right, and like I said it was kludgey, made all of the other things I was having the Arduino do all the more difficult. I wanted to fix it, to the point of making a fool of myself asking about it on the Arduino forums.
So anyway, after studying some other examples for a while, and reading some other things, I figured out the "proper" way to do things - store your (up to 4 connections) in an array; that way you can check them for (dis)connections, send data to individual ones, etc. Elementary I assume, but like I said .. you gotta start somewhere. Fantastic! No more pinging!
Except now I've lost the stability I had. Couldn't keep a connection for more than 12 hours it seemed, let alone the days and weeks I had before. Frustrated, I basically dropped the project for a couple of weeks. The other day I decided to pick it up in earnest again. I couldn't find anything wrong with my code after going over and over it, so I hit google again..
This thread came up, and seemed pertinent. I installed the fix version of the library, and what do you know - stability. Been almost 48 hours now. Nothing else has changed, so I'm going to call it fixed. I don't really know what to make of all that, but I find it interesting that I had such great stability using the first method overall, even though it was "wrong".
I want to thank everyone for their efforts, especially Norbert for the superb support of his library. It truly makes it a pleasure to use. I can only hope I can contribute to the general community on such a high level one day.
Edit: 72 hours+ and still going strong.
-
Using official Arduino Mega ADK via proper 3.3V level shifter. Powering ENC with 3.3V from the same Arduino.
ENC stopped responding to pings after just one night. The main program continues to work.
The main program polls 1-wire temperature sensor, displays the temperature to an SPI display and controls another device using standard digital outputs.
I strongly suspect that the problem is in Errata 14.
If I have some time (which is not likely to happen any time soon) I'll look into it and will try to fix it for EtherCard library as it is smaller for I will need all that stuff to fit into Pro Mini based on ATmega 328.
-
Hi!
Are any fix was applied to 1.4.1 EthernetGateway?
Just build it based on atmega128A (need more flash to implement DHCP later).
It is running without any issue for a day.
-
ha!
everyone have to care about the source. Have t measure the source current and have added a multimeter.This probably add very small voltage drop, but it was more than enough for ping to start loosing packets! It is very sensitive to the power source. After adding just 0.2V pings stop been lost.
-
@axillent said:
After adding just 0.2V pings stop been lost.
You're saying that you got it back to life without doing a reset or interrupt the supply first?
-
@m26872 said:
You're saying that you got it back to life without doing a reset or interrupt the supply first?
I'm not looking for the extreme recovery abilities)
My steps are:-
multimeter added to measure current between +3.3V source and VCC pin on ENC shield
-
power on
-
result - many ping were lost
-
power off
-
3.3V source was turned 0.2V up
-
power on
-
result - none of ping were lost
-
-
So glad to have found this thread! Sounds like the exact problem I've been having. Ethernet gateway, with enc28j60 shield would randomly stop responding to Vera, to telnet (on 5003) and to pings. Sometimes in hours, occasionally making it a few days. Looking at my sensors, the radio stack is still working fine, the Ethernet side of the GW is just a dead stick.
I replaced my UIPEthernet files with the ones from GitHub above, disabled DEBUG and UDP to get it to fit, re-compiled, and re-deployed. Will report back how it goes!
BTW - the statement above says: "•DON'T activate the DEBUG-flag in MyConfig.h because this definitely will 'break' the gateway." I had to turn off DEBUG to fit the code on my UNO anyway, but I'm curious to why having DEBUG on breaks gateways?? (I had DEBUG on in my previous version with the old UIPEthernet files.. Did it contribute to my issue?).
-
My Mega app was surviving maybe as much as 5 minutes using the UIPEthernet library for Arduino >= 1.5. I made the changes as you suggest noting in particular that the reset has to come first. It's now been running flawlessly for 48hrs. I have have great hopes for this continuing.
I don't understand why this is still not in the main distribution after so long.
Anyway many thanks for posting this solution. It's a lifesaver.
-
Thank you for this post! Saved the day!
Just for information, im using EtherShield, with this lib: https://github.com/jonoxer/etherShield/blob/master/enc28j60.c
(it has ugly limitations, but has the smallest memory requirement in my case)The solution fits almost exactly, with some small modifications (around line 290):
if( (enc28j60Read(EIR) & EIR_TXERIF) ) { enc28j60WriteOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRST); enc28j60WriteOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRST); enc28j60WriteOp(ENC28J60_BIT_FIELD_CLR, EIR, EIR_TXERIF); // Might be overkill but advised by Microchip Errata point 12, //MagKas 2014-10-25 }
Thanks again!
-
HI! Thanks for the topic! I'm new on this forum and I'm using enc28j60 with UIPE Lib. I have a lot of hang-up in my sketch.
Can somebody tell me wich is the best adapter-library combination to make a secure client-server arduno based sensor?
enc28j60 O WIZNET5100? UIPEEthernet, Ethercard or EtherShield ?The solution proposed is modify enc28j60.cpp about line 215
Enc28J60Network::sendPacket(memhandle handle)
.....
if( (readReg(EIR) & EIR_TXERIF) )
{
writeOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRTS);
}
...
}by
if( (enc28j60Read(EIR) & EIR_TXERIF) )
{
enc28j60WriteOp(ENC28J60_BIT_FIELD_SET, ECON1, ECON1_TXRST);
enc28j60WriteOp(ENC28J60_BIT_FIELD_CLR, ECON1, ECON1_TXRST);
enc28j60WriteOp(ENC28J60_BIT_FIELD_CLR, EIR, EIR_TXERIF);
}It is correct???
Sorry for my English. Thanks again. Daniel
-
@amoarg69 AFAIK the Enc28J60 has no stable setup yet. Use the W5100.
-
@m26872 Thanks!
Wich library is better to use? UIPEthernet works well with W5100? I need a library that can handle client/server request.
Thanks very much!!!
-
@amoarg69 W5100 is also the Arduino standard Ethernet shield and hence works good with the Arduino standard Ethernet lib. Server/client exemples provided through Arduino IDE for the lib.
-
This post is deleted!
-
I choose to reply to this very old topic because I updated to the latest UIPEthernet and this issue is still happening to me. I was about to replace it with a W5100 but came across this and tried it, compiled, uploaded, and here it goes. 24 hours of continuous web serving using an Arduino Mega 2560 and still counting!
I wish this get incorporated into UIPEthernet
-
@mhdayusuf Do you have a copy of the changes you made? I am quite struggling to make it work and is not obvious to me what changes you made... Thanks.