fault tolerance

CrankyCoder

So right now I have my mysensors gateway(s) setup on 2 raspberry pi's. They are both setup for mqtt gateway which is on a diff server.

I use keepalived to determine which pi should be running the mysgw binary.

This works pretty well. However I am considering changing over to 2 network gateway configurations and use a floating virtual IP.

Does anyone have any pro's/cons for one method over the other?

The pro that I can see is being able to use other programs that can communicate over the network gateways to do OTA updates. As well as being able to use bindings for things like discovery.

mfalkvidd

@crankycoder great idea. Gateway redundancy is something I've seen requested several times.

Would you mind sharing how you set up keepalived? I think it could be useful for a lot of people.

Just a note before diving in: the ethernet gateway can be configured in client mode (using --my-controller-url-addres). This will result in the connection originating from the gateway machine just like MQTT (in case you see that as a useful configuration).

Method 1 (your current method) - connection initiated from the gateway:
When the connection is originating from the gateway, keepalived becomes something that could break so we add some complexity compared to a system with a single gateway.

Method 2 - connection initiated from the controller:
I only have experience from Domoticz, so other controllers might have a richer feature set, but the only thing Domoticz can do to detect and handle a gateway failure is to re-connect to the same IP/hostname if no data is received for a specified amount of time.

For that method to be effective, there needs to be something that redirects the (virtual)IP or hostname to the currently active gateway. That something adds complexity to the solution, and there would be a nonzero risk that this something fails, rendering the redundancy useless. Keepalived is still needed, so method 2 has more complexity (keepalived + "something" instead of just keepalived) and therefore larger risk of failing than method 1.

So I see a potential downside with method 2, and I can't really see any benefit. But maybe I'm missing something.

CrankyCoder

@mfalkvidd I will put up a write up on what i did. Including my health check script (which may address some of the other items on how I handle the failures ect)

Keepalived with the check script and health script is what I use to keep tabs and it also moves that virtual IP back and forth allowing for the same ip to stay in use and restart the binary when needed

fault tolerance

14

11.7k

11.2k

113.2k