Discussion: Reliable delivery

mfalkvidd

Alright. I have done some reading :-)

First, anyone interested in this discussion should probably read this thread where the protocol specification was discussed. Awesome work by @Yveaux @Zeph @ToSa @Damme and @hek !

If there are any other references on the design decisions for the protocol, please let me know.

Note: I use the term originating node in the text below. By that I mean a single MySensor node, a leaf node in the routing tree. (or the gateway if reliable messages are sent from the gateway to a MySensor node)

After reading up on the radiohead rf library and a blog post about the protocol used for ulpnode, combined with my experience from the "internet" world, I realized one thing: There are two philosophies to reliable delivery, and we need to decide which one suits MySensors best.

hop-to-hop acknowledgment
As identified in the protocol thread, Radiohead uses a "hop-to-hop acknowledgment" philosophy. The way I see it (everyone is welcome to offer a different view on this) is that the world we live in when we want the "hop-to-hop acknowledgment" looks like this:

End nodes are considered extremely low power. They have little memory and/or need to conserve power
End nodes want to get rid of the responsibility of a message as soon as possible. Either because they want to go back to sleep, or because they don't have enough memory to buffer messages.

This puts very high reliability requirements on all relay nodes. If the originating node has gotten an ack, the node that sent the ack is 100% responsible for making sure the message is delivered. Since the originating node might have gone back to sleep, there is no way to let the originating node know if something bad happens. So all relay nodes MUST be able to buffer a lot of messages, maybe not indefinitely but at least for a reasonable long time. The originating node wanted reliable delivery after all, so the relay node(s) really need to make sure the message is delivered. If the message wasn't important, the originating node would not have chosen the reliable delivery mechanism.

end-to-end acknowledgment
The most widely used protocol on Internet, TCP, uses a "end-to-end acknowledgment" philosophy. Routers (Internet's equivalent of MySensors' repeat nodes) can (and do) drop packets at any time. They are not responsible for buffering. In fact, buffering leads to worse performance in the network. The philosophy is that the originating node is the node that is most interested in getting the message delivered, and that node is also best equipped to make decisions on timeouts and re-transmission.

This philosophy also assumes that originating nodes are likely to be less busy than repeat nodes. My come computer does a lot less packet sending than the central routers on the Internet. The originating node might only be having a single conversation, and keeping a single message in memory is no problem: the node needs to have that message in memory anyway to call the send function.

.

Does this make sense? If so, which of the philosophies are more suitable when implementing reliable delivery in MySensors?

The current implementation is sort of "hop-to-hop acknowledgment". But MySensors does not offer high reliability on repeat nodes. As far as I understand, repeat nodes will not buffer and re-transmit messages, even though they have said "Don't worry little guy, I'll take care of this message and deliver it reliably" to the originating node. If the message fails later, the failure will not be detected.

I don't mean to complain about the current implementation. There were probably lots of design choices that I am not aware of, and what we have is awesome. I just think that we can do better when it comes to reliable delivery.

Sorry about the long post.

hek

Good summary @mfalkvidd.

There is actually another looong thread that should give you more perspective/history here:
http://forum.mysensors.org/topic/304/2-0-discussion-units-sensor-types-and-protocol
It's not been realised.

Just a note. The bool you're getting from send() was never meant to be used for detecting if a message was delivered to its final destination. It's merely exposes the "hardware" acking result from the NRF-module. So, it's an indication if the node is totally off the grid.

mfalkvidd

Thanks @hek! Now I have something to read before I go to sleep tonight :)

mfalkvidd

I have collected some real-world use cases and read the protocol design discussions in the forum. There is something I don't understand though:

What is the difference between hardware and software ack, and difference between the ack in present/sendSketchInfo/send/sendBatteryLevel/, mSetRequestAck and isAck() in incoming messages.
I have read these posts but I'm unable to figure out how the library is designed.
http://forum.mysensors.org/topic/1163/can-anyone-shed-some-clarity-in-this-ack-business-hek/4
http://forum.mysensors.org/topic/649/rc-from-send-and-how-to-identify-an-ack-message/7

From @hek's response in the first thread, it looks like the library is designed to support BOTH end-to-end acknowledgment and hop-to-hop acknowledgment, but how to use that isn't very clear, at least not to me. Could anyone care to explain?

hek

Low-level: When a node wants end-to-end ack, it sets a bit in the message header "req ack". The node that receives this type of message, sends an ack-message with the "req ack" bit off and "is ack" bit on. Otherwise messages would create a endless loop.

The "hardware ack" functionality is only used to detect if a node loses connection with it's closest neighbor.. and should start looking for a new parent.

mfalkvidd

I think I'm most confused by this statement:

@hek said

It will first send the hardware ack and then the end-to-end software ack provided by the MySensors library.

What is the end-to-end software ack? Was it never implemented?

mfalkvidd

ok, let's see if I understand now. How much of this is correct?

Hardware ack is what I get when using send(msg, true) and some other functions like present, sendSketchInfo and sendBatteryLevel

Hardware ack is hop-to-hop acknowledgment
Sending is a blocking function which returns true if an ack was received and false if ~70ms passed without receiving an hardware ack.
Sending ack is handled automatically by the library(/radio)
Receiving an ack is handled automatically by the library
Re-send needs to be handled manually in each sketch (a while-loop with a counter counting up to maxRetries and using wait() between each resend seems to be the most common solution)

Software ack is what I get when setting msg.mSetRequestAck before sending the message (is this true? I'm unable to find any documentation on mSetRequestAck on http://www.mysensors.org/download/sensor_api_15)

Software ack is end-to-end acknowledgment
Sending is a non-blocking function (from the software ack perspective, using hardware ack together with software ack is allowed and in that case the rules for hardware ack applies to hardware ack but the call is not blocked until the software ack comes through)
Sending an ack needs to be handled manually in each sketch (by checking msg.mGetRequestAck in incomingMessage? I'm unable to find any documentation on mSetRequestAck on http://www.mysensors.org/download/sensor_api_15 )
Re-send needs to be handled manually in each sketch (setting up a timer might be a good solution)
Receiving an ack needs to be handled manually in each sketch (by checking msg.isAck() in incomingMessage and clearing the timer mentioned above)

hek

Nope,

Hardware ack is always enabled in the MySensors library (except for broadcast messages).

End-to-end ack is enabled with the true flag.

mfalkvidd

But then end-to-end ack is horribly broken? Since the next hop node will respond with an ack even if it is a repeater?

hek

No, only the findal destination node will answer with a soft ack message.

Note, the (soft) ack message has to be picked up by yourself in your incomingMessage function.

hek

You can look at the RelayWithButton example on how to use it.
https://github.com/mysensors/Arduino/blob/development/libraries/MySensors/examples/RelayWithButtonActuator/RelayWithButtonActuator.ino

Here the node sends a message with soft ack enabled when someone presses the button. It doesn't change the local (light) state until ack message is received.

mfalkvidd

@hek said:

No, only the findal destination node will answer with a soft ack message.

Note, the (soft) ack message has to be picked up by yourself in your incomingMessage function.

Thanks for explaining. Just checking if my understanding is correct:

bool send(MyMessage &msg, bool ack);

will return the result of the hardware ack, regardless if the second parameter is true or false?

So my post above should have been like this:

Hardware ack is always on

Hardware ack is hop-to-hop acknowledgment
Sending is a blocking function which returns true if an hardware ack was received, and false if ~70ms passed without receiving an hardware ack.
Sending hardware ack (when the message has reached the next-hop node) is handled automatically by the library(/radio)
Receiving an hardware ack is handled automatically by the library, by setting the return value from the send call mentioned above
Up to 15 tries are made automatically by the library. Further re-send needs to be handled manually in each sketch (a while-loop with a counter counting up to maxRetries and using wait() between each resend seems to be the most common solution)

Software ack is what I get when using send(msg, true) (and is also supported by some other functions like present(), sendSketchInfo() and sendBatteryLevel()

Software ack is end-to-end acknowledgment
Sending is a non-blocking function
Sending ack (when the message has reached its final destination) is handled automatically by the library
Re-send needs to be handled manually in each sketch (setting up a timer when sending the original message might be a good solution)
Receiving an ack needs to be handled manually in each sketch (by checking msg.isAck() in incomingMessage and clearing the timer mentioned above)

hek

Yes! Looks correct!

If you want to dig even deeper you can tune the hardware ack/message burst handled by the radio here:
https://github.com/mysensors/Arduino/blob/development/libraries/MySensors/core/MyTransportNRF24.cpp#L59

The radio sends a burst of maximum 15 messages (with a 5us interval) by itself unless it picks up a respond by the other node.

mfalkvidd

Very cool! Thanks a lot @hek! Can't believe I have read several hundred forum posts about the protocol design and ack problems the last two weeks without realizing that what I needed was already included in the library.

From the discussions I've read it looks like almost no-one else has understood either.

Some ideas to help people understand how to use the built-in features for reliable delivery:

Add an example that has a complete implementation of end-to-end ack usage. This includes

setting up timer(s) for re-sending
storing sent messages so they can be re-sent
determining which message was acked when an ack message is received (there might be several messages that haven't been acked yet)
removing acked message from the sent message store and clearing the timer(s)
re-sending when the timers expire

Update the documentation for bool send(MyMessage &msg, bool ack); (and similar functions) to explain that the bool returned is the result of the next-hop ack.
Rename the bool ack parameter to bool end-to-end-ack or something similar to make it clear(er) that there are two types of acks
In the documentation for send() (and similar functions), refer to the example in (1) for information on how to use end-to-end ack.
When the example in (1) is good enough, add some of the required code to the MySensors library so people don't need to copy-paste a lot of code into their sketches. Surround the code with ifdefs to make it optional.

(2) and (3) should be quite easy to do. Can it be done with a pull request or how are documentation improvements handled?

I'm willing to do 1 (will post the sketch in the forum for public scrutiny/feedback of course).

4 can be done when the community think the example is "good enough"

5 can wait until the example has been thoroughly vetted.

hek

Sounds like a plan.

The code documentation could be solved with a PR as you say. But the main site isn't available on github (a messy thing) so I have to update that part when PR arrives/has been merged.

mfalkvidd

ok. I can create the PR. From where can I clone the repo?

mfalkvidd

After a quick chat with hek I now know that I should add code comments to the relevant functions in the code and he'll manually update the API documentation. I'll post a link to the PR when I'm done.

robosensor

Seems like it is impossible now to distinguish for which message we have received acknowledgement. For example, when you just sent two identical messages. Message should contain message ID field (also called packet id or sequence id, unique for sending side) either in protocol headers (need to change protocol and break compatibility) or in user-defined message body (also breaks compatibility).

mfalkvidd

@robosensor Yes it is impossible to distinguish between acks for two identical messages. But in which use case is that a problem?

robosensor

@mfalkvidd for example, to know last actuator state.

Imagine that you are sending three commands:

ON
OFF
ON

Then you receive two confirmations: ON and OFF. For which one ON ack is received? Is current actuator state after receiving two confirmations ON or OFF? How to determine which message (first or third) is lost and what you need to resend?

Discussion: Reliable delivery

19

11.7k

11.2k

113.0k