Best practice for hardware ack and software ack when using a battery node

evb

There are plenty of old and recent topics about the hard- and software ack, but I didn't find a best practice to follow :-(

Let's begin with the documentation

bool send (MyMessage & msg, const bool requestEcho = false )		
Sends a message to gateway or one of the other nodes in the radio network

Parameters
* msg: Message to send
* requestEcho: Set this to true if you want destination node to echo the message back to this node. Default is not to request echo. If set to true, the final destination will echo back the contents of the message, triggering the receive() function on the original node with a copy of the message, with message.isEcho() set to true and sender/destination switched.

Returns
Returns true if message reached the first stop on its way to destination.

send(msg.set(value==HIGH ? 0 : 1));

If the method send returns with true, the hardware ack did work.

battery node <==> gateway : we are sure, the message is delivered to the gateway
battery node <==> repeater <==> gateway : we are sure, the message is delivered to the repeater, but further on, we don't know.

If the method send returns with false, the hardware ack did not work, the message is not delivered.
So, what is here the best practice?
Doing this for example in the battery node?

if (!send(msg.set(value==HIGH ? 0 : 1)))
  if (!send(msg.set(value==HIGH ? 0 : 1)))
    if (!send(msg.set(value==HIGH ? 0 : 1)))

3 times or 4 times or 5 times or...
==> a method who retries x times, goes to sleep for a short time, retries again, etc. ?
It should be possible to combine this with the normal interrupt based sleep of the battery door node.

How does the repeater node handles the hardware ack?
Everything is handled by the mysensors library. Is there a way to do here the same in case of a failing hardware ack?
==> if each node (normal and repeater nodes) auto retries to deliver the message, we are sure that the message gets delivered.
But then another the question arises, what if the repeater receives 10 messages from different nodes to forward and the repeater is still retrying to deliver the first? It must create a sort of queue and have enough memory to stock the messages...

Is using the software ack better?

send(msg.set(value==HIGH ? 0 : 1), true);

and implement in the receive method the check of the returning message.
But this means that a battery node must stay awake...
What if no return ack message is received after 5 seconds or 10 seconds?
Resend the message? But we are in the receive method, I don't know if we can send here messages???

The reason for this question is a battery door node at the edge of reliable radio connection with the gateway. I did add a repeater node in the middle, but still sometimes the message gets lost...

mfalkvidd

@evb basically you are trying to solve what's called "the two generals' problem", which has been proved to be unsolvable.

Some corrections:

If the method send returns with false, the hardware ack did not work, the message is not delivered.

This statement above is incorrect. The message could very well have been delivered, but the ack was lost.

if each node (normal and repeater nodes) auto retries to deliver the message, we are sure that the message gets delivered.

This statement above is incorrect. By default , for nrf24, a node will retry 15 times and after that the send() function will return false. No more retries will be made after that. So even though auto retry is on by default, auto retry does not guarantee delivery.

Is using the software ack better?

This statement above is incorrect. MySensors does not have software ack. If you think software ack exists, you have been fooled, just like many many many many other MySensors users (myself included).

Since you are trying to solve a proved insolvable problem, there are no best practices. Workarounds can be made, but each workaround would have to be tailored to the specific use case.

evb

Hi @mfalkvidd

Ok, I understand your point. Did read trough plenty of topics on this ;-)
Should be an article on the official website and not spread over so many different topics, hint, hint ;-)

Ok, let change the title then to Best practice for the possible workarounds for hardware and software ack when using a battery node.

Physically, because we are using radio, we can't have a 100% situation, but we can aim to it.

checking the return of the send method and retrying yourself x times by the sketch code, can already help to aim for the 100%.
The radio nrf24 and I imagine also the rfm69?, does already hardware retries, together with the software user retries, will increase the possibility that the message is delivered... with the comment that a message may arrive several times on the controller!
Is there a best practice in this workaround? Experiences from members?
can my first point also be done on a repeater? Because the 'repeating' is done by the library...
Can we override or intercept this library behaviour?
the echo reply mechanism or what I'm calling the software ack
That's the one I doubt to be usable in battery nodes, but maybe members have positive or negative experiences with this?

BearWithBeard

I think sending a message repeatedly is really only helpful for occasional hickups or connection issues. Say, somebody walks by a node and physically weakens or blocks the signal right when it attempts to send a message. Or the gateway / repeater is busy with another task, or reboots / reconnects for some reason.

If the connection issue is persistent, because it is too far away from any parent node and you already know that you will need multiple TX attempts more often than not, the best practice would be to address the range issue itself, IMHO. Either increase the output power of the transceiver or, use (another) repeater with a higher receiving sensitivity. Remove the cause of the problem instead of fighting the symptoms.

But I understand that you want certain nodes to be as reliable as possible. I address occasional connection issues on my coin cell powered contact nodes by attempting to send the state up to a limited number of times and with a decent pause inbetween attempts. If all those attempts fail, I increase a failedTxAttempts counter variable and send it separately as a crude "reliability indicator". If this occurs too often, I need to address the issue. That being said though, my nodes are fairly reliable. Last time I checked, 1 out of 1250 messages failed on average (0.08%).

Here's a snippet from my contact node sketch. Hope it makes sense to you, ~~I just copy and pasted it~~.
Edit: I removed the functions and put everything inside the loop for brevity. I also added some comments.

void loop()
{
	static bool contactState;
	static uint8_t failedTxAttempts;

	contactState = digitalRead(PIN_CONTACT);
	
	bool sent = false;
	uint8_t txAttempt = 0;

	// Attempt to send the contact state up to MAX_TX_ATTEMPT times
	do
	{
		sent = send(msgContact.set(contactState));
		if (!sent)
		{
			// Message didn't reach parent or didn't get ACK from parent
			sleep(FAILED_TX_PAUSE); // Sleep for a while (500ms or so)
			++txAttempt;
		}
		else
		{
			// Received ACK, give visual feedback
			digitalWrite(PIN_LED_OK, HIGH);
			sleep(MY_DEFAULT_LED_BLINK_PERIOD);
			digitalWrite(PIN_LED_OK, LOW);
		}
	} while (!sent && txAttempt <= MAX_TX_ATTEMPTS); // MAX_TX_ATTEMPTS: 5
	
	if (!send)
	{
		// Contact state couldn't be sent, increase TX error counter
		++failedTxAttempts;
	}

	if (failedTxAttempts != 0) 
	{
		// Report that there were contact state change(s), which failed to be sent
		wait(TX_PAUSE);
		if (send(msgFailedTxAttempts.set(failedTxAttemts)))
		{
			// Reset counter variable if ACK was received
			failedTxAttempts = 0;
		}
	}
	
	// Do other stuff & sleep
}

A similiar goal to those failed TX indicators can be achieved using the internal indication handler. You can read more about it in this post. This can also be used on the gateway and repeaters. I don't know of any way to intercept or "force repeat" relayed messages on a repeater manually.

Regarding the echo, I guess you could easily calculate how much requesting an echo would impact the battery life. Take a timestamp with millis() right before and after send() without requesting an echo and check how long it takes on average. I suspect this will be about 80ms. Then compare this to a message with an echo where you take the time right before sending and right after receiving the echo in receive(). The difference between the two times should roughly equal the time the transceiver spends in a high power state to listen for the echo.

evb

@BearWithBeard thanks for your code samples, it confirms my own thoughts and gives a code guideline at the same time :-)

@mfalkvidd, in the reply of @BearWithBeard, he linked to another topic post where you use the indication handler in your sketch.
TxOk and TxErr are send to the controller. Their values are always incremented to infinity by your sketch?
How do you interpret this at the controller?

@sundberg84 says : INDICATION_GW_TX sounds like a good plan. This is a great tool I think for the future to evaluate and debug your network. I used S_CUSTOM and a utility meter (hourly) in HA to get the values.
Do you have a code sample of your sketch? An example generally says more then 1000 words ;-)
I think you send each time a value '1' when an error occurs, correct? And on the HA side you use the hourly utility meter integration to sum the values on a hourly base?

mfalkvidd

@evb said in Best practice for hardware ack and software ack when using a battery node:

TxOk and TxErr are send to the controller. Their values are always incremented to infinity by your sketch?
How do you interpret this at the controller?

Yes. A counter is good because missed messages don't matter very much.

I don't do anything at all. But if I would do anything, I would probably calculate

(new_counter_value - previous_counter_value)/(new_timestamp - previous_timestamp)

and plot that, perhaps with some averaging over time, depending on how often the node sends.

evb

A second thought about using the MySensors method indication.

void indication(indication_t ind)
{
  switch (ind)
  {
    case INDICATION_TX:
      txOK++;
      break;
    case INDICATION_ERR_TX:
      txERR++;
      break;
  }
}

A successful or failed transmission of the error message itself, will also trigger this method, which then actually gives a wrong picture.
Or has this been thought through too far? :thinking_face:

mfalkvidd

@evb if an error message is transmitted in the ether, and no-one is around to acknowledge it, was it really sent?

BearWithBeard

@evb The indication handler will not only be triggered by the error message, but also by (almost) all internal messages, like registration requests, node and sensor presentation, nonce, discovery responses, find parent requests, OTA firmware upgrades and others. Basically everything that uses the default transportRouteMessage() function.

It's up to you if this gives a wrong picture. If you want to know how reliable the uplink of node is, it's exactly what you would want to use, isn't it? If you, on the other hand, only care about your own messages, or perhaps even only a few of them, you would propably want to avoid the indication handler, as it may result in "false positives".

It's basically the opposite of the "track only a single MyMessage object" approach I was showing above.

skywatch

@mfalkvidd If the message was sent by a cat, maybe...... ;)

Best practice for hardware ack and software ack when using a battery node

13

11.7k

11.2k

113.2k