Posts made by GaryStofer

GaryStofer

First let me say that the test "network" consists of a single node and a Serial GW with a Domoticz controller only. No repeater nodes are present and it operates at 250Kbd on a frequency outside of the WiFi spectrum verified to have no traffic. The test subjects are within 10 feet of each other and the protocol sniffer receiver is placed roughly midway between the two communicating nodes. The test node has a plugable connector for the RF modules so that I can switch them out separately. The GW has the LNA/PA module and the node is using the standard mudules. Both GW and node are set to "High" power.

Looking at the problem with the promiscuous protocol sniffer and wireshark I can clearly see that packets collide mid air when both the node and the GW transmitting at the same time. This happens when the controller/GW talks to the node while having the protocol ACK request enabled and the node does not have the extra wait/delay programmed.

Going through 12 RF modules node side, I can see that about 1/3 shows the problem consistently to the extent that all 16 retransmissions are used up , 1/3 to various degrees and the rest shows no problems at all. Before you say "ohh fake chips etc." these modules all came from the same batch, so more than likely not fake and non fake, but rather some parameter that the lib assumes to tight of a margin for.

When turning off the protocol level ACK via the controller/GW, the package goes out without an request for the node to send back the package, only the HW ack remains in the picture. Setup like this I can see that the third of the modules that cause the persistent collisions evoke a re-send of the package by the GW after ~2 to 2.5ms , while the group that shows no problem shows one single transmission from the GW only. Since the sniffer is not fast enough to capture the quick turn around ( spec says 130us) for the short HW ack to go out I can not get say whether the node is simply not sending it or just not sending it in quick enough. All I can say is that the GW code did not see a HW ack and therefore sends out the package a second time.

Meanwhile on the node side all is well, it just might get the same package twice and since no status changed there is no harm done. However when you now turn on the protocol level ACK request in the controller ( as you should) the node has to turn around and send out the echo package which then collides with the resend of the original package from the GW. The node also doesn't get an HW-ACK becasue of this and tries to resend the echo package over again until it runs out of tries.

When I then introduce the wait/delay in the code above to wait out the 2.5Ms for the GW to resend the initial package this ACK/resend storm goes away because the air is clear when the node sends it's echo package. If I wait/delay for 2 ms the problem is greatly reduced and if I wait 3 ms I see no more evidence of collisions. Either wait or delay works however, I choose delay() to be positively sure that no other communication is started during that time by the node.

I conclude that as a work around this solves the issue we are having with the RF24 transport, but without further investing into faster RF sniffing equipment I can't say for sure if this could be solved better on the TX side by waiting longer for the HW ack to arrive from the receiving node, however I think there would be a good chance for that.

There is also a chance that the PA/LNA module used on the GW could be part of the problem by not being able to switch quickly enough back to RX mode and therefore loose the HW-ACK for nodes that respond quicker than others.

Edited: Checked with a non PA/LNA module on the GW and no change could be observed, so this theory can be discounted.

BTW: All of this is very reminiscent of an email communication in 2015 I had with Ekblad !!

GS

GaryStofer

@skywatch yes, wait() works too. My thinking for using delay() was that I did not want any possibility of any incoming message being processes during that time, so therefore delay instead of wait. But I did not do any further digging in the code to assess such risk or do any stress testing with the node being a repeater also , etc. I figure the implementers of the lib would know what's best when implementing a fix. It's also possible that the root cause lies in the Serial GW code if other gateways do not have this problem. It could be that it just takes the GW code a little too long to switch into RX mode to catch the ACK. I figured @hek or @Yveaux would get to the bottom of it and know what to do. Do they still read the forums or do I need to do a pull request on the GIT ?

GS

GaryStofer

When communicating from the gateway to a node there seems to be none, or an insufficient delay for the TX/RX turnaround time. This causes the GW not to fully receive the incoming ACK message and therefore fail the transmission. With a protocol sniffer it can be observed that the receiving node sent out the ACK message immediately , but otherwise properly.

The mere inclusion of MY_DEBUG serial print messages on the receiving node, "solves" the issue by
providing a slight delay between reception and subsequent transmission of the ACK/ECHO package allowing the transmitting node to switch from transmission to reception properly.

A delay of 3-4 ms in file MyTransport.cpp / line 703 in the code block of

if ( msg.getRequestEcho()) {
     TRANSPORT_DEBUG(PSTR("TSF:MSG:ECGO REQ\n")
#ifdef MY_RADIO_RF24
     delay(4);
#endif
....
...
..
}

Solves the issue of the lost ack messages.

This has been tested on multiple RF24 modules -- With and without LNA/PA both on the gateway and node sides.

Suggest to add above code to the code base so that other people don't have to go through days of debugging to find this problem again.

See post further down for details and explanation .

G.S.

GaryStofer

@mfalkvidd No, the issue with sleep(0) still exists in V2.3.2. That is the version that the ArDuino Library installer installs currently . -- Work around is to call hwPowerDown(WDTO_SLEEP_FOEREVER) as in above code if you want the battery to last more than a day.

GS

GaryStofer

Solved by adding a short delay between the RX/TX turnaround when replying with an ACK message.

In file MyTransport.cpp at line 706 in the "if (msg.getRequestEcho) block, right after the debug message "TSF:MSG:"ECHO REQ" added a 2ms delay to allow the caller to switch from TX to RX mode.

GS

GaryStofer

@ferro , @monte, @hek , @Yveaux
I guess I have not mentioned this before :
When turning on MY_DEBUG or MY_DEBUG_VERBOSE_RF the ACK problem goes away . The Debug trace looks as expected and the GW is happy with what it gets back from the node in terms of ACK. No error message .... turn Debug off and the problem reappears.

This is why I think that there is a timing issue, with the RX/TX change over time. By adding the Serial prints it slows the RX/TX change over down enough for the GW to be able to catch the ACK package and be happy.

If Hek or Yveaux is not going to put his hat in the ring here I guess I have to either run with debug permanently on -- Heck a messy -- or dive down and debug the library myself.

GS

GaryStofer

@ferro The systems use the Serial Mysensor GW -- some connect via USB some connect directly to the serial port on a Pi.

I cant change the Controller to anything else , not that I think is controller related -- The customer is happy with Domo and has many man hours invested in scripts and interfacing.

ACK, when transmitting from node to GW/controller, seems to work fine, both HW-ACK and "protocol" ack. ACK from GW/Controller to node seems to be flat out broken, but I can't tell yet if the node is not sending out, or the GW is not receiving it properly.

Also noticed that the GW code is not using the IRQ line from the radio, I see the SPI clock running on the GW constantly.

Thanks

Also

GaryStofer

@monte
Yes --- I know why it "Works" without ACK enabled -- That's not the point. But of course one needs to have the ACK working on the GW otherwise the controller and Node can get out of sync status wise.

Yes -- all of the above you mentioned was tried -- It's not a HW problem, nor a RF interference , nor a range problem -- Spectrum analyzer shows no other traffic on the chosen Freq.

Like I said -- All this worked under 1.5 reliably -- with the same HW -- Flawlessly, even on a somewhat noisy RF channel.

GaryStofer

@ejlane No -- Not every time -- Just when I need it ---

GaryStofer

I am working on upgrading a couple of existing Mysensor 1.5 networks to the latest version and am again running into this behavior of getting the dreaded error message: "Error sending switch command, check device/hardware !" from the Domoticz UI when trying to send commands from the controller to Relay nodes. Even just using the relay example from the lib shows the problem. I am convinced that this is not a common HW issue as often blamed. The issue has been verified with a multitude of HW and I see no other communication issues such as range etc.

So, the problem shows up on any message sent from the controller to a node. For testing purposes the network has only the Gateway and one Relay node. When clicking on the Lightbulb device to affect the relay most times, but not always, you see the error message after the timeout, even though the command was received and handled by the device right away. You also see that the controller sends the command a second time, when this happens ( about 90% of the times) .. No change in network frequency or phase of the moon makes a difference here.

Now here is what I found that did make a difference:

When I go on the Domoticz UI under "Setup/Hardware" and drill down through the Gateway "Setup" link to the Node and then select the child under scrutiny, I see at the bottom of the page an On/OFF selector labeled "ACK" together with a timeout value of 1200 . If I turn this "ACK" off then the node works flawlessly -- a million times / forth and back and even through power failures of the gateway/controller. I see only one command sent and no error messages ever.

Upon further investigation I further found that right after initial inclusion of a node this "ACK" feature seems to be off. The node works without retires/ error messages, however only until you go and update it with a name for example, then the ACK features gets turned on by default.

Now the question to @hek is : What is this "ACK" feature doing-to/expecting-from a node. Is this looking at the HW ACK from the NRF24 radio and that's what is getting missed by the GW most of the times. Is there some sort of TX/RX turn around time that is too short possibly? Or is this based on some response packet the node is expected to send from it's receive function? I did not see anything mentioned in the examples or the .h files to that effect.

When checking the HW-ACK from the Node to the GW ( in the node sketch ) I find that it works as expected. I only get NACKs when I block the antenna sufficiently to suppress the transmission.

Disabling the "ACK" feature for the node via Setup/Hardware/MyGateway/node/child on Domoticz is a workaround but probably not the desirable solution, especially as long as the problem is not fully understood.

@hek @Yveaux ??

GaryStofer

@ejlane

yes -- EEprom gets erased by code in the before() section

GaryStofer

@Elfnoir @hek I see this old issue is still with us in MySensor 2.3.2 / Domoticz 4.10717.

I am working on upgrading a couple of existing Mysensor 1.5 networks to the latest version and am again running into this behavior of getting the dreaded error message: "Error sending switch command, check device/hardware !" from the Domoticz UI when trying to send commands from the controller to Relay nodes. Even just using the relay example from the lib shows the problem. I am convinced that this is not a common HW issue as often blamed. The issue has been verified with a multitude of HW and I see no other communication issues such as range etc.

So, the problem shows up on any message sent from the controller to a node. For testing purposes the network has only the Gateway and one Relay node. When clicking on the Lightbulb device to affect the relay most times, but not always, you see the error message after the timeout, even though the command was received and handled by the device right away. You also see that the controller sends the command a second time, when this happens ( about 90% of the times) .. No change in network frequency or phase of the moon makes a difference here.

Now here is what I found that did make a difference:

When I go on the Domoticz UI under "Setup/Hardware" and drill down through the Gateway "Setup" link to the Node and then select the child under scrutiny, I see at the bottom of the page an On/OFF selector labeled "ACK" together with a timeout value of 1200 . If I turn this "ACK" off then the node works flawlessly -- a million times / forth and back and even through power failures of the gateway/controller. I see only one command sent and no error messages ever.

Upon further investigation I further found that right after initial inclusion of a node this "ACK" feature seems to be off. The node works without retires/ error messages, however only until you go and update it with a name for example, then the ACK features gets turned on by default.

Now the question to @hek is : What is this "ACK" feature doing-to/expecting-from a node. Is this looking at the HW ACK from the NRF24 radio and that's what is getting missed by the GW most of the times. Is there some sort of TX/RX turn around time that is too short possibly? Or is this based on some response packet the node is expected to send from it's receive function? I did not see anything mentioned in the examples or the .h files to that effect.

When checking the HW-ACK from the Node to the GW ( in the node sketch ) I find that it works as expected. I only get NACKs when I block the antenna sufficiently to suppress the transmission.

Disabling the "ACK" feature for the node via Setup/Hardware/MyGateway/node/child on Domoticz is a workaround but probably not the desirable solution, especially as long as the problem is not fully understood.

@hek @Yveaux ??

GaryStofer

@eiten The puzzle is , why does the existing network function without fail under the same conditions, while the new network can not even connect when the nodes are a foot apart. But then if eventually they connect I get full expected range with no missing acks ..

GaryStofer

@skywatch duh !!

GaryStofer

I am upgrading a network that has been running for probably 7 or 8 years now to the latest version of the MySensors Libray (2.3.1). The nodes and the serial gateway are purpose built PCBs that contain a 328P, power regulators and the radio boards. The old network operated on the default ch #76 without any significant problems for years,
Now with the latest v2 Library I have the problem that the nodes fail to connect to the gateway during the power up, i.e. network init and presentation fails with NACK's. On the LEDs of the gateway I see that the communication attempts are marked with the RED error led in unison with the NACKs from the debug output of the connecting node. ---

Here comes the strange bit: If I eventually get the node to connect with the gateway by putting the node right over the gateway and waiting long enough (minutes) , I can then move the node away across the house and it communicates without a single failure all day and all night long.

At this point there is only one node and one gateway in the picture and the Node's parent is directly set to the gateway address of 0. All "user mode" communication is done with the HW-ACK feature on since we have only a single hop network. Network operates at 250kbits/s and changing power settings do not improve the issue. Nodes use IRQ signaling and I changed RF24.cpp to poll the IRQ line for transmit as well instead of polling status via SPI.

So the question is: What is different during the init and presentation that makes the nodes not connect, but then once the hooked up eventually makes the network function after all.

When switching to a channel above about 100 , i.e. out of the WiFi spectrum I get no, or almost no initial NACK failures.

So what has changed that it can't get through the initial phases of the node connect on a slightly noisy channel even when mere centimeters apart.

GaryStofer

For people who like a clean PCB sensor or serial gateway node based on Atmel 328P for battery or 5-12V operation check out the PCB I made here on https://oshpark.com/shared_projects/5RV25Fc0. It incorporates the voltage regulators , capacitors and connector to simply plug in a NRF24 module and be done with it. Soldering is required but not terribly difficult. I usually just depopulate an Arduino nano for most of the parts ...

GaryStofer

@alex28 Hi Alex, sorry for your frustration.

I think documentation, either incomplete or outdated, is the crux of most open source projects.

I can not speak to the Hardware you used to try to get a network up as I have stuck to the more simple approach of using the simple Arduino ( ATmel328P) nodes for both sensors, repeaters and gateways, while using a RPI as the network controller with something like Domoticz running on it. I started years ago , even before the RFM069 was an option and made PCBs that incorporate the NRF24 and the ATMEL 328P along with the necessary voltage regulators etc. Maybe it was easier to get started then because there where less options and less misleading documentation was available, but I don't recall running into any problems worthwhile of mentioning or having the level of frustrations you have encountered.

I have two sites running with 8 and 12 nodes each. All sensors running on batteries. The range of the NRF24 is limited in that I only get through one or two sheet rock walls inside the house, but using the NRF24 module with the built in PA/LNA on the gateway and one repeater opened up the range considerably. I'm fairly positive that all the NRF modules I have a re clones....

I use the serial gateway on the Atmel328p and connect that directly to the serial port of the Rpi Zero-W, without USB adapters then run Domoticz on the PI to get onto the internet .

If you look on OSH - PCB you will find many good PCBs that you can make MYsensors nodes with using the more simple Arduino platform.

Most of my frustrations stemmed from the Linux configuration for the RPI so that it doesn't clobber the SD card on surprise power failure.

Cheers -- Gary

GaryStofer

@neverdie said in 2AAA battery NRF24 Sensor PCB with ATmega 328p:

Hmmm ... is that accurate? I thought digital LO may be higher than actual GND.

Yes, perfectly accurate. The output stage is a FET switch and the current it switches to GND is ~2 micro amps, It's 0V. One could measure the VCC an other way not needing a voltage divider by configuring the ADC so that it uses the VCC as the reference and then measure the internal 1.1V reference. However that takes re-configuring the ADC each time into and out of this mode if you need the ADC for other measurements as well and doesn't save any pins.

GaryStofer

@neverdie
The LDO is only used when running from 5 or 12V supplies. Cant measure 3.3V with a 1.1V reference .