I have been playing with Rhasspy on and off since the snips debacle. I recently got a little more in to it and have a neat little setup so far.
I have an instance of rhasspy running on a pi3 with a docker container. That container will soon be moved to something more powerful, but for now it's working quite well. That pi just has a set of simple speakers and an old playstation eye webcam (for the microphone).
Then I have a small pi zero w with a respeaker 2 mic hat and a tiny like 2 inch "mini loud speaker". That pi obviously has little power, but I have all the heavy lifting being done by the other pi. So the zero is doing the wake word detection, audio record, speech to text (for the moment) and audio playback.
The other pi is doing the same things but also intent analysis and intent handling. Both pi's are sharing an mqtt broker on my network. So the pi3 is actually processing intents and intent handling for itself AND my little satellite module.
My text to speech is currently using a docker container of marytts running on yet another machine and handles the tts for both pis.
For anyone that knows me and my setup, I am a big kubernetes guy, so stuff like the marytts, intent handling, intent analysis all those things are being moved into my cluster. Then all my pi's can basically share the same "brain" as it were.
My whole system for this just like my home automation is designed to run local. Which is why I am not using google's speech to text or amazon polly or something like that. I know those are more natural sounding and realistically if I need to switch it later it's easy.
Rhasspy has DEFINITELY made it nice to swap pieces in and out and made it very modular.