Monday, May 27, 2019

Robots talking out loud to each other: Adventures with the Assistant API

Ever watch a science fiction movie or TV show where, in a near-future setting, two humanoid robots meet up on the street and somehow converse with each other in spoken English?

Robot Hell, as depicted in Futurama

It doesn't make any sense. The first robot takes their state, serializes it through a complicated text-to-speech system, and then converts this already lossy data into actual sound waves. Then the second robot captures the sound waves via a microphone (adding noise to the problem), deserializes the waveforms via an even more complicated speech-to-text system, and then somehow parses and comprehends it. It's an incredibly lossy communications path.

The robots are connected to some sort of network! They can just exchange bits directly! Any protocol would be better than this one! Why would anybody do this?

Well, about a year ago (yes, this post is very overdue), I realized that the Google Assistant API is actually this interface that will result in this robot hellscape of the future.

Find My Device API

A slight aside for why this project might be useful in any sense (however, it isn't).

Android has a nice feature called "Find My Device," which un-mutes your phone (if it is in silent mode) and starts ringing it regardless of what state it's in. This feature comes stock as soon as you hook up your Google account, and doesn't require you to install any additional apps, which makes it perfect for the first time you lose your phone.

I wanted to hook this up to one of the spare Amazon WiFi buttons that I had lying around to ring my phone, but unfortunately, there's no actual API endpoint to contact the Find My Device service. However, the Google Assistant is able to contact the service and the Google Assistant has an API!

Let's take a looksie at the nice gRPC API that's provided for the Google Assistant. Note that I will use the documentation for the deprecated version of the API to first discuss the initial state that I found the world in, version v1alpha1 (an update is at the end of this post with the diff from the current version). The API is straightforward, a single rpc with the signature

rpc Converse(ConverseRequest) returns (ConverseResponse)

If you click on the ConverseRequest you'll notice it only has two fields possible: config, which provides information about how to process the request (such as the audio rate frequency), and audio_in, the audio bytes that you'd like the assistant to process for you.

That's it. That's the state of the first rendition of the API. There's no way to pass text in or get text out. The only thing you're allowed to communicate is audio data.

What can one do with this API? Well, we'll try the obvious: we can use eSpeak or any other synthesizer to create audio to give to the assistant, and then we can use any sort of speech recognition library to parse the output.

Robots talking to each other

We can quickly hack up a solution around the examples library (link to GitHub commit; the changes aren't actually very substantial, just a thin wrapper around the already existing audio helper classes) and be able to have some sort of modicum of success (the new flags are --use-text-input and --use-text-output):

python3 pushtotalk.py --device-model-id 'model' \
--device-id 'device' --use-text-input --use-text-output

INFO:root:Connecting to embeddedassistant.googleapis.com
: where's my phone
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "worm".
INFO:root:Transcript of user request: "worms".
INFO:root:Transcript of user request: "worms my".
INFO:root:Transcript of user request: "worms tomorrow".
INFO:root:Transcript of user request: "worms  tomorrow".
INFO:root:Transcript of user request: "worms  my phone".
INFO:root:Transcript of user request: "worms my phone".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "where's my phone".
INFO:root:Expecting follow-on query from user.
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: I found a few phones the first listed is a pixel 3XL should I ring it
: no 
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "no".
INFO:root:Transcript of user request: "no".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "no".
INFO:root:Expecting follow-on query from user.
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: how about the CLT l29
: yes
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "new".
INFO:root:Transcript of user request: "yes".
INFO:root:Transcript of user request: "yes".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "yes".
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: alright your phone should be ringing now

Note that sometimes the assistant isn't actually fast enough to correct the initial "worms my phone" that it gets from our text to speech engine into "where's my phone" and it fails. I've found that the query "where is my phone" is a little more effective at making this not flaky (stupid future...).

But now we can script around this and automate ringing our phone(s)! More on the iptables configuration there possibly in a future post; this blog post is already more than a year overdue...

Note that for the specific query Where's My Phone/Tablet?, the Google Assistant will typically answer "I found a few phones; the first listed is an XXX, would you like me to ring that?" (if you're me; if you have a less complicated digital life and don't hate the future as much, you might avoid this problem entirely and your phone will just immediately start ringing, solving the interaction problem immediately). The order here seems non-deterministic, but the good news is that the names of the phones seems configurable via Play Store settings, which means that if our speech-to-text engine were having trouble understanding them, we could rename them into difficult-to-confuse words.


Full disclaimer: Due to the long time it took me to actually get around to putting all of these pieces together, some details here aren't fully accurate anymore. A sharp-eyed reader might notice that the current version of the API (and the version that I actually did end up forking from the google-assistant-sdk sample library) supports text input, so we didn't actually have to use eSpeak to create the audio file given to the assistant API.

However, it's not actually usable in this mode: when you give the assistant a text input, it only sometimes gives you text output back; most of the time it just streams you back an audio response. Specifically, for the query Where's My Phone?, the response query seems never to be given via text.

Regardless of what you're doing, for some reason, one of the robots on that street will always be speaking back in English via audio. Stupid future.



Futurama is a trademark of Twentieth Century Fox Film Corporation. Google, Play Store and Google Home are trademarks of Google LLC. Wi-Fi is a trademark of the Wi-Fi Alliance. The trademark owners listed (and any others not listed) are not affiliated with and do not endorse this blog post.