Smart Speakers

ejlane

Well, this is not super on-topic for the site, but there's probably a lot of cross-over interest with people here on doing a local smart speaker.

It came up in another thread with @KevinT, but was really not the topic of that thread, so I'm starting this one to have a better place to hold the discussion.

So the current "smart speakers" of yours are really just output devices? Is the MQTT to speech internal to the phone, or does something else do that?

I'd definitely be interested in continuing the discussion. Kind of off topic for this whole site, since its use of MySensors would be minimal. There would be a two-way integration, but it wouldn't be very tightly coupled.

My thought was a raspberry pi, or even a powerful server if needed for the speech recognition and also for responses, and then just raspberry pi zeros for the satellite devices. (They would be doing wake word detection and then simply passing audio back and forth, so not a lot of power needed here.) It would be able to be a pretty cheap overall solution, while keeping full privacy because everything would be done here at the house, not sent out to the cloud.

I'm going to add links here to what I think are relevant projects.

https://snips.ai/ is what got me started on the idea and general architecture, but they got bought out and shut down before I could really get my whole plan off the ground. I was still fighting the learning curve and only had a single device kind-of working like I wanted when the announcement was made. Their website now is just a landing page that points to sonos, so it's useless. Used to be that you could still view the old info there for a while.

Project Alice is an open-source fork of what was snips. Snips had been mostly open-source, with the large exception of their web-configurator tool. Project Alice is not a finished product, but it looks like it's in a useable state, though it also looks like it's mostly the passion project of a single developer. I think this is the way that I'm going to go, once I get enough free time to wrap my mind around it all. https://github.com/project-alice-assistant/ProjectAlice https://community.projectalice.io/

Then there's Mycroft. https://mycroft.ai/ It looks reasonably good for a front-end, but they use Google on the speech recognition. At least they realize the privacy downside of this, and they aggregate everyone's speech snippets and proxy it through their own server, so that minimizes the amount of info that Google could get out of it. They have been supporting Mozilla's Deep Speech project, but I don't think that you can send requests to that over the cloud. Though they also say that if you have hardware with enough power to run the Mozilla service locally that you can set it to do that. https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customizations/stt-engine#default-engine

While Mycroft will run on a Raspberry Pi, it needs to be a larger, more powerful one. It can't just run on a zero. https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/get-mycroft/linux#system-requirements Although, now that the Zero 2 has come out, it might be enough.

Finally, just to be complete, there was the Jasper project. https://jasperproject.github.io/ I was especially interested in this one when it came out, both because it ran on a Raspberry Pi and my son is named Jasper. I thought that would be cool. It seemed like a great start, but then it died off pretty quickly. I got the feel like it was a student project and the students then were done with school and it didn't go anywhere.

That could be totally false - I didn't follow it closely. But it's been basically abandoned for quite a few years now. I didn't understand the code well enough to just pick it up and run with it, so I never did anything other than read up on it every year or two.

So that's my list for now. If nothing else, it's nice to have this all in one place for future reference. Right now I still think I'll end up going with Project Alice, but with how long it takes me to get to these things it might be a long time before I actually get anything going. If anyone else has suggestions, or things that I've missed, I'd love to hear about them!

KevinT

@ejlane Yes, my speakers are best described as MQTT Text to Speech output devices. Basically, I wrote a small application which runs on a phone using MIT App Inventor. It subscribes to the Speak topic on my local MQTT broker. Whenever someone publishes to the Speak topic, it converts the text to speech. I have quite a few automations in Home Assistant which publish to Speak. I have 3 old phones set up as speakers around my house.

Regarding AI smart speakers, there is a lot of new stuff out there, with the release of tinyML.
From what I've read, the easiest type of speech recognition is "Keyword spotting" - see Edge Impulse
The next level is "Speech to Intent" - see Wio Terminal TinyML course
And the highest level is Large-vocabulary continuous speech recognition.

ejlane

@KevinT What is it that actually does the conversion from text to speech? Is it on the phone? Did you create that part, or did it come with the phone, or something else?

Yes, I'm aware of the keyword spotting, but I can't think of any applications in my house where that would be enough, except for using it to get the wake word for a more fully-fledged speech to text. I hadn't see the Edge Impulse learning site, though, so thanks for that!

I also hadn't even considered something like the mid-range like in the Wio Terminal link you have. I have to study that some and think about it. That could be enough for what I'm doing... But in any case I want to work through the tutorial and see what I can learn.

But overall I'm still a bit set on the full-fledged idea with Project Alice. Which might actually not be reasonable. I need to think about it. It could just be that I set myself on that a while back and now going with anything else feels like giving up? But in any case, these other options will be great learning experiences, so I'm going to look into them.

Thanks!

KevinT

@ejlane Oh, now I get your question. Text to Speech is built into the Android OS on the phone. There is a synthesis engine which runs on the phone and generates the speech. Although it likely has networked features, for example, loading different voices.
Project Alice looks quite interesting. I'll definitely be digging deeper. I see it runs on Raspberry Pi's, maybe it will run on my Ubuntu server too.

ejlane

@KevinT Yes, that was my question, thanks. I didn't realize that Android had that built in - I thought the voices you hear were generated in the cloud and then the wav or mp3 was sent to the phone. (Like when using GPS to get around or whatever.) Or is it the same voices that you hear while doing that? Is it more basic?

Anyway, thanks for the answer - that's interesting.

Yeah, Alice must be able to run on pretty much any server, but I don't know how much manual effort would be required to alter settings. Maybe they even have an easy-to-use config change that will set up everything?

What I like the most is that the 'satellites' don't need much power, because all they really need to do is wakeword detection and then piping audio back and forth. Just need one with enough grunt to do all the speech to text and back. With our use patterns, that one wouldn't even get worked all that hard for the most part.

Actually, I'm really looking forward to this, because I have a Death Star 3d printed to hold all the electronics, and an addressable 60-LED lightring to go around the outside. So it will also function as a clock and timer/stopwatch with the ring.

And I'm also adding a piezo mic to act as an impact sensor, so you can use them as targets for nerf darts. I got that idea from someone who made an alarm clock that you turn off by shooting. As soon as I saw that I knew I had to add it to my speaker.

Oh, and the depression for the laser beam from the Death Star is where the speaker mounts. I have all that 3d modeled and printed, just need to do all the electronics and programming... :) I just need to figure out a way to make some simulated lasers and to be able to light them up. That would also work as a bit of a speaker grille to protect it. I haven't spent much time on that part, but I think it would be a nice addition, and I need to see about it...

ejlane

@CrankyCoder pointed me to another candidate for the speech-to-text part; Rhasspy in a post in another thread.

Funnily enough, in reading through the documentation, it states that it interoperates with Snips, and provides a link. Even though going straight to the Snips website provides nothing but a landing page, using their docs link ends up with all kinds of info. So it's still there, but hidden from a casual search.

rejoe2

How about Rhasspy?
That's kind of successor to snips and uses the same (hermes) protocol (MQTT).

For just some "regular" commands to any home automation system just something starting with Pi 3B+ should have enough power to do that.

In my setup (HP T620 as headless x86 server) this runs beside FHEM and deconz on my central machine and receives audio input by mobile phone app (ESP32 satellites are possible as well) allowing full control to all my blinds, lights, ...

CrankyCoder

I have been playing with Rhasspy on and off since the snips debacle. I recently got a little more in to it and have a neat little setup so far.

I have an instance of rhasspy running on a pi3 with a docker container. That container will soon be moved to something more powerful, but for now it's working quite well. That pi just has a set of simple speakers and an old playstation eye webcam (for the microphone).

Then I have a small pi zero w with a respeaker 2 mic hat and a tiny like 2 inch "mini loud speaker". That pi obviously has little power, but I have all the heavy lifting being done by the other pi. So the zero is doing the wake word detection, audio record, speech to text (for the moment) and audio playback.

The other pi is doing the same things but also intent analysis and intent handling. Both pi's are sharing an mqtt broker on my network. So the pi3 is actually processing intents and intent handling for itself AND my little satellite module.

My text to speech is currently using a docker container of marytts running on yet another machine and handles the tts for both pis.

For anyone that knows me and my setup, I am a big kubernetes guy, so stuff like the marytts, intent handling, intent analysis all those things are being moved into my cluster. Then all my pi's can basically share the same "brain" as it were.

My whole system for this just like my home automation is designed to run local. Which is why I am not using google's speech to text or amazon polly or something like that. I know those are more natural sounding and realistically if I need to switch it later it's easy.

Rhasspy has DEFINITELY made it nice to swap pieces in and out and made it very modular.

ejlane

@CrankyCoder Thank you so much for the detailed explanation of your setup! That sounds very much like how I want my final result to be. I still have no idea how I missed Rhasspy as a decent solution when snips got bought out. It was just never on my radar, which in hindsight is ridiculous. I have no explanation for the huge oversight.

However, now I've been looking into it more, thanks to you, and it's looking great! I also like that there's been some collaboration between Rhasspy and the guy behind Project Alice, so it looks like there's no bad blood there. I was looking at the LED control code, and surprisingly to me, I found out that the Raspberry Pi can even control the ws2812 LEDs directly! I thought the timing constraints were too much for it, but I see that people have made clever use of some of its hardware to be able to handle it, even with a non-realtime OS.

So with that, I'll be able to do without a supporting mcu, it looks like. I guess I'm running out of excuses and need to get started soon.

I was comparing the TTS modules, and although the online ones do sound better, as you say, I also want to keep everything local for privacy reasons. I did like the sound of the MaryTTS, but when I was listening to samples online, it sounded like picoTTS was a bit better to my ear.

I haven't figured out my final configuration for sure, but it could end up being a docker instance on an old repurposed desktop that has extra bandwidth or maybe on a dedicated Raspberry Pi 4. I think either one would be enough, because it would be relatively rare for more than one satellite to be interacted with at the same time.

Do you like Kubernetes much better than Docker for any certain reasons? I guess this is mostly curiosity for me - I currently have 5-10 docker instances running on a couple computers and they're mostly hands-off as far as any maintenance, so I would be hesitant to put in the time to start switching at this point.

CrankyCoder

if you want to play with some ws2812 stuff. i HIGHLY recommend looking at the wled project from aircookie. uses a wemos d1 mini. flash the firmware and off you go. integrates into EVERYTHING. tons of built in lighting effects and some crazy extras like supporting e1.31 protocol so you can add your led strips to christmas light shows using stuff like xlights, vixen2 and falcon pi player.

I haven't really tested pico but I would guess there is a way to use it as a drop in just like i have with my current marytts.

As far as the kubernetes, the reason I use that is 2 main reasons.

i currently have a 7 node cluster. So if a container dies, or a node needs to be patched or something, kubernetes just gets it back up and running somewhere else quick.
i am CKA certified and do alot of kubernetes for work. So it eventually bled over into my hobby. it's probably WAY overkill, but I even run my homeautomation software in it.

KevinT

@ejlane Your Death star speaker sounds pretty impressive! LEDs, timer/stopwatch, impact sensor, and of course speaker & microphone, she'll be loaded. You'll have to share a few pictures. How big will it be? Which Pi fits inside it?

ejlane

@CrankyCoder as far as the LED project, I'll look into that, thanks! I don't have any plans to go that crazy to need the e1.31 stuff, but I guess support for features you don't use doesn't hurt... :)

Yeah, it looked like there was just a selection box, and they were both choices. I haven't gone any deeper than that, or installed it on my own hardware yet. Day job and family stuff are keeping me too busy to spend much time on it other than just dreaming/planning.

Some of what you said might have gone over my head with Kubernetes. Makes sense that you would use what you're good at. But by saying you have a 7 node cluster, does that mean 7 physical machines that will actively share the load? If so that's pretty cool, but far more than what I'm needing any time soon. (I think. Unless I get really deep into some big project, but there's nothing on the horizon right now.)

ejlane

@KevinT It's sized for whatever size a 60 LED ring is. I bought them off Aliexpress and sized to that. A regular Pi could probably fit in there, though I'm not 100% sure on that. I planned for them to be Zeros. Only thing is, it's been a couple years since I started thinking about it, and I just don't get around to actually doing it much yet. So it might still be a while before I get anything finished.

Though just today I got a marketing email about an ESP32-S3 product that is aimed at machine learning and might very well be able to handle all the needs of a satellite. It would still need software support, so it's not ready to go, but that would be even lower cost and power budget for the satellites.

https://www.hackster.io/news/espressif-launches-esp32-s3-box-an-all-in-one-esp32-s3-dev-system-for-tinyml-edge-ai-work-89421f602b2d

So I did a bit of searching, and others are also considering it: https://community.rhasspy.org/t/best-esp32-based-hardware-for-satellite/3012

Looks like the chips should have plenty of power for wake word detection, but it would have to be coded. So far if I understood it correctly, they are just streaming everything to the Rhasspy server full-time. Not that it would be a ton of bandwidth, but I'd rather not be broadcasting all audio in every room of my house 24/7. I think it's just the very inelegant design hurts my engineer's brain... :)

CrankyCoder

@ejlane Correct. 7 nodes = 7 machines. BUT, the 7 current machines are 7 raspberry pi 4 (8gig Ram) modules. it is 1/2 of what my goal is. Eventually it will be a full 14 node cluster.

I had been doing something similar with the pi satellites. I found that if you tell it to use a UDP broadcast to localhost for the wakeword/recording then it doesn't send the audio frames to the mqtt broker.

Smart Speakers

25

12.0k

11.2k

113.4k