First let me say that the test "network" consists of a single node and a Serial GW with a Domoticz controller only. No repeater nodes are present and it operates at 250Kbd on a frequency outside of the WiFi spectrum verified to have no traffic. The test subjects are within 10 feet of each other and the protocol sniffer receiver is placed roughly midway between the two communicating nodes. The test node has a plugable connector for the RF modules so that I can switch them out separately. The GW has the LNA/PA module and the node is using the standard mudules. Both GW and node are set to "High" power.
Looking at the problem with the promiscuous protocol sniffer and wireshark I can clearly see that packets collide mid air when both the node and the GW transmitting at the same time. This happens when the controller/GW talks to the node while having the protocol ACK request enabled and the node does not have the extra wait/delay programmed.
Going through 12 RF modules node side, I can see that about 1/3 shows the problem consistently to the extent that all 16 retransmissions are used up , 1/3 to various degrees and the rest shows no problems at all. Before you say "ohh fake chips etc." these modules all came from the same batch, so more than likely not fake and non fake, but rather some parameter that the lib assumes to tight of a margin for.
When turning off the protocol level ACK via the controller/GW, the package goes out without an request for the node to send back the package, only the HW ack remains in the picture. Setup like this I can see that the third of the modules that cause the persistent collisions evoke a re-send of the package by the GW after ~2 to 2.5ms , while the group that shows no problem shows one single transmission from the GW only. Since the sniffer is not fast enough to capture the quick turn around ( spec says 130us) for the short HW ack to go out I can not get say whether the node is simply not sending it or just not sending it in quick enough. All I can say is that the GW code did not see a HW ack and therefore sends out the package a second time.
Meanwhile on the node side all is well, it just might get the same package twice and since no status changed there is no harm done. However when you now turn on the protocol level ACK request in the controller ( as you should) the node has to turn around and send out the echo package which then collides with the resend of the original package from the GW. The node also doesn't get an HW-ACK becasue of this and tries to resend the echo package over again until it runs out of tries.
When I then introduce the wait/delay in the code above to wait out the 2.5Ms for the GW to resend the initial package this ACK/resend storm goes away because the air is clear when the node sends it's echo package. If I wait/delay for 2 ms the problem is greatly reduced and if I wait 3 ms I see no more evidence of collisions. Either wait or delay works however, I choose delay() to be positively sure that no other communication is started during that time by the node.
I conclude that as a work around this solves the issue we are having with the RF24 transport, but without further investing into faster RF sniffing equipment I can't say for sure if this could be solved better on the TX side by waiting longer for the HW ack to arrive from the receiving node, however I think there would be a good chance for that.
There is also a chance that the PA/LNA module used on the GW could be part of the problem by not being able to switch quickly enough back to RX mode and therefore loose the HW-ACK for nodes that respond quicker than others.
Edited: Checked with a non PA/LNA module on the GW and no change could be observed, so this theory can be discounted.
BTW: All of this is very reminiscent of an email communication in 2015 I had with Ekblad !!
GS