i hate the future

Monday, June 29, 2020

Deep Chernoff Faces

One of my favorite¹ concepts for multi-dimensional data visualization is the Chernoff Face. The idea here is that for a dataset with many dependent variables, it is often difficult to immediately understand the influences one variable may have on another. However, humans are great at recognizing small differences in faces, so maybe we can leverage that!

Tired: traditional plotting

The example from Wikipedia plots the differences between a few judges on some rating dimensions:

which already improves on the "traditional" way to present such data, which is something similar to a line chart:

or a radar chart:

Clearly² the Chernoff faces are the best way to present this data, but they leave some of the gamut unexplored.

Wired: using GANs to synthesize faces

Over the past couple years, NVIDIA has worked on generating faces using the techniques from generative adversarial networks, and have published two papers on their results and improvements to their generation techniques. This is the neural network structure that is responsible for the system behind thispersondoesnotexist.com and myriad other clones such as thiswaifudoesnotexist.net, thisfursonadoesnotexist.com, thiscatdoesnotexist.com, thisrentaldoesnotexist.com, and the list goes on and on. The faces from these networks have even been used in international espionage to create fake social media profiles.

The GAN technique is pretty easily summarized. You set up two neural networks, a generator, which tries to generate realistic images, and a discriminator, which tries to distinguish between the output of the generator and a real corpus of images. By training the networks together, adding more layers, fooling around with hyperparameters and overall "the scientific process", you manage to get results where the generator network is able to fool the discriminator (and humans) with high-quality images of faces that do not actually map to anybody in the real world. StyleGAN improves on some of the basic structure here by using a novel generator architecture that stacks a bunch of layers together and ends up with an intermediate latent space that has "nice" properties.

Basically how it works.

A trained StyleGAN (1 or 2; the architecture for the dimensions does not change between versions), at the end of the day, takes a 512 element vector in the latent space "Z", then sends it through some ~~nonsense~~ fully-connected layers to form a "dlatent"³ vector of size 18x512. The 18 here indicates how many layers there are in the generator proper—the trained networks that NVIDIA provides produce 1024x1024 images; there are 2 layers for each of the dimensions from 2^2=4 to 2^10=1024.

What part of STACK MORE LAYERS did you not get?

The upshot is that the 18x512 space has nice linearity properties that will be useful to us when we want to generate photorealistic faces that only differ on one axis of importance. The authors of the NVIDIA paper call each of the 18 layers a "style", and the observation is that copying qualities from each style gives qualitatively different results.

The styles that correspond to coarse resolutions bring high-level aspects like pose, hair style, face shape; the styles that correspond to fine resolutions bring low-level aspects like color scheme. But they only roughly correspond to these, so it's going to be somewhat annoying to play around with the latent space! What we need is some way to automatically classify these images for the properties we want to use in our Chernoff faces...

Putting it all together

What I did was take an unofficial re-implementation of StyleGAN2 and run it to generate 4096 random images (corresponding actually to random seeds 0 - 4095 in the original repository). The reason why we're using an unofficial implementation is that the unofficial implementation ports everything to TensorFlow 2, which also enabled me to run it with (a) less GPU RAM (I only have a GTX 1080 at home) and (b) on the CPU if needed for a future project where I serve these images dynamically.

I was also too lazy to train a network to recognize any features, so I fed these images through Kairos's free trial, receiving API responses that roughly looked like this. (There's no code for this part, you can just do it with cURL⁴).

Then, somewhere in this gigantic unorganized notebook that I need to clean up, I train some linear SVMs to split our sample of the latent space; eventually I will clean this up and make it so it's a little more automated than the crazy experimentation I was doing. After I train the SVMs and receive the normal vector from their separating hyperplanes, I manually explore the latent space by considering only one style's worth of elements from each of the normal vectors, plotting the results, and seeing what changes about the photos (which could be different than the feature from the API response!). This could probably be automated; I should likely use the conditional entropy of the Kairos API responses given the SVM result to rank which styles are most important and then automatically return that instead of manually pruning styles.

I ended up with seven usable properties plus one I threw out before I got tired of doing this by hand.

yaw
eye squint
age
smile
skin tone
gender
hair length
quality of photo

I decided not to use quality of photo (you can see the results in the notebook results) because I didn't want half the photos to just look terrible. The good news is that I made a new notebook for generating the actual faces, one that should actually work for you if you clone the repo.

After generating a few images and using the `montage` command from ImageMagick, we get our preliminary results!

Now that's the future!

¹ "Favorite" might be code for "useless," going with the theme of this blog.
² Clearly.
³ It's called a dlatent in the code, but the paper calls it the space W and the vectors w. I don't know.
⁴ for q in `seq 0 4095`; do i=$(printf "%04d" $q); curl -d '{"image": "http://cantina.patrickxia.com/faces/seed'$i'.png"}' -H "app_id: xxx" -H "app_key: xxx" -H 'store_image: "false"' -H "Content-Type: application/json" http://api.kairos.com/detect > $i.json; sleep 1; done and something similar for the `media` API, which gives you landmarks. If you care enough, I've hosted the images here, which lets you just look through them if you want.

Monday, January 27, 2020

Interfacing with the real world

I moved to a new apartment building recently¹, and have run into the problem of not having a dedicated package room to receive my Amazon packages.

The building is equipped with a call button system, but the buttons don't reliably signal the upstairs unit, and there's no guarantee somebody is at home to let the courier in. The Amazon courier typically attempts to call the phone number that's listed on the package to be let in, but not always. Sometimes they press the button. Mostly, all I get are notifications that look like this, which only get resolved after a bit of back-and-forth regarding "delivery instructions."

Internet of Crap

From the inside, the door control looks like this:

All I need is for some sort of system to automatically press the button that unlocks the door whenever somebody calls the number on the package. Easy enough, right? There are dozens of IoT robotic button pushers, all of which promise to do this sort of thing!

However, experience tells us that this glorious future is in fact entirely crappy. I had previously tried using a Switchmate to control my bedroom light. When pressing the physical button, it somehow only worked eight times out of ten. Even worse, when using the Wi-Fi feature, it worked less often; about six times out of ten. I hate the future so much. I was incredibly happy to dump that piece of junk on the "free table" at work so it could annoy somebody else instead.

So what's there to do? We can cobble our own unreliable IoT solution out of parts that we have lying around! The parts I ended up having lying around that ended up being useful are:

An old (original!) Raspberry Pi A (here's the A+, the oldest thing that I can find on Amazon, which features a smaller form factor and more pins on the header)
A random 802.11n wifi USB dongle
COTO 9007-05 Reed Relays

Past experience has indicated that relays are really the best way to go here instead of a transistor-based approach, which often ends up finicky and not pressing the button reliably enough. In fact, these reed relays were ordered "in anger" after incredible frustration with a prior project. Our old angry self comes to save the day for us!

Figuring out what to connect to what

Taking the door unit off the wall, we see:

which makes it clear that the function of the door button is to short the orange wire to the green wire (apologies for the bad picture -- the traces are easier to see in person).

All we need to do, therefore, is attach the actuation points of the relay across a GPIO pin on the Raspberry Pi and its ground -- that way, when the Raspberry Pi asserts that pin, we energize the relay and end up acting as the button.

I ended up doing this in two phases. The door unit is wired to the apartment complex with just standard cat5 cable. I cut up another cat5 cable and screwed the appropriate wires into the same terminals, then used a RJ45 coupler and soldered onto the other end of a new cable. That way, (a) I wasn't restricted to soldering in the same room as the door unit, and (b) I could detach my Raspberry Pi entirely from the door unit as needed.

Here's a picture of that hacked up solder job:

Note we are indeed shorting the green wire to the orange wire from the cat5 bundle. It doesn't help I used orange to connect the wires to the Raspberry Pi though. Here's everything shoved into a shoebox to protect from environmental damage. Super professional setup here.

Raspberry Pi server

We can write this in Python because this whole thing is a dirty hack.

All we need to do is combine http.server with RPi.GPIO. To stop the whole Internet from being able to actuate our door lock, we only respond to one particular URL, which is a UUID that I generated. You might think this is "security by obscurity." I prefer to say that "we have an authentication system that is backed by a securely generated bearer token."

Every time we get a GET on our HTTP server that matches that UUID, we toggle a GPIO pin high, hold it there for a while, then toggle it back low. We can deal with the concurrency problem like we do with any other problem: by pretending it doesn't exist. Luckily, http.server only handles one request at a time, so it is written with our design goals in mind. If we get two requests, one can wait.

Here's a gist of that first version of the code.

Phone interface

Now that we have the Raspberry Pi capable of controlling the apartment lock, we need some way to be able to call it from any phone.

I used to use Tropo for all of my CPaaS (... which is apparently "Communications Platform as a Service"; I hate the future) needs because they offered free development accounts that let you experiment all you wanted as long as you were okay with no SLA. Unfortunately, they have since been acquihired by Cisco and their product is no more. It's a shame, because that API was pretty great -- you would get a no-nonsense REST call against your webserver whenever an incoming call came in, which is all we really wanted.

I tried out a couple demos, and Plivo offered me $10 of free credit so I thought that it'd be the best to just get started quickly. However, the console (a) kept bugging out on me and kicking me into creating a new incoming phone number, and (b) was super confusing about how to hook up all the parts together.

I think my original confusion is that all of the demo applications on Plivo are just a statically-hosted XML file, and I didn't think to change the endpoint to access my home web server instead. The fact that the workflow kept kicking me out to "Welcome to Plivo" because it was a demo account also did not help.

All roads in the console seemed to lead to hitting the "PHLO" button. This is apparently a flow chart-based version of "programming a phone system without knowing how to program," so I figured I'd try my hand at it. After struggling a bit, I managed to create this "program" (with secret tokens and phone numbers redacted):

There's actually a bug in this program, but I can't seem to fix it. Bonus points if you figure out what it is! Oh well, it sort of works, which is good enough for me.

Productionizing the solution

After having this system up for a few weeks, it worked "fine." However, there are a few issues with it. We will only fix some of them.

Authentication between the CPaaS and the door opening system is done via a bearer token sent in cleartext.
There is no authentication at all on the phone number side.
The electronics are super janky. The relay is rated for 5 volts and we're actuating it with a 3.3V pin. The solder job is completely hacked together, though the shoebox does offer great protection against the environment.
The system is open loop; if disconnected from the door, the calling user still gets a "success" message.
There's no monitoring of uptime of the system.

We will not deal with deficiencies (1) and (2) at all. If we cared enough:

We can use TLS to encrypt the interaction between Plivo and us to prevent the bearer token from being stolen. Even better: Plivo cryptographically signs some headers to indicate that they are legitimately from Plivo, so we can authenticate the client without the added TLS layer and certificate headache. One wrinkle: the documentation and examples given by Plivo are insecure. They don't verify that any nonce is only used once, which opens up any system that uses their example code to replay attacks. This isn't just a problem with Plivo. Twilio doesn't even offer a nonce for this feature. Something something stupid future something.
We can demand that the users who call the number type a PIN in, and rotate the PIN per delivery. However, not many people actually randomly call the number, and the threat model isn't great. Just hitting the call buttons at random in an apartment building often gets you somebody who's willing to buzz you in if you claim to have a delivery.

We will deal with deficiency (3) at the same time as deficiency (4).

Our current system has the great property that we have Galvanic Isolation from the apartment's door mechanism itself. This is another reason why using a simple transistor to allow current to flow through the door button terminals doesn't work well; you'd end up coupling the two systems if they share a common ground. We'd like to maintain the isolation, as it's also not great to couple your own electronics with shared infrastructure for the apartment complex.

What we need to be able to do is observe current going through the relay². One way to do this while maintaining galvanic isolation is to use one of my favorite devices, the optoisolator. The optoisolator is a combination of an LED and a phototransistor in the same package. If current flows on the LED side, it lights up, and allows current to flow through the transistor. If no current flows, there's no LED, and the phototransistor blocks a current path. I ordered a few PS2501-1 optoisolators from eBay for this purpose.

Here's the updated circuit diagram. We've added an NPN transistor (I grabbed a 2N2222, but the characteristics aren't super important; the important thing is that we're driving the relay with more voltage than the 3.3V before). The 200 ohms for the resistor is chosen to limit flow to less than 16 mA from the pin.

Note that the optoisolator is in one package, although it's drawn as a separate LED and phototransistor here. GPIO14 is configured with a pull-up resistor, which means that if current is flowing through the LED, reading from that pin should indicate a "low" value, whereas if current is not flowing through the LED, the phototransistor blocks current, which means that we would see a "high" value.

Deficiency (5) is easily dealt with third-party tools. Since I already have a Google Cloud account for other miscellaneous things, we can use Uptime Checks within that cloud project to configure notifications whenever the system is down. Uptime Checks are available for free; Stackdriver logging is billed, but there's an always-free threshold of 50 GiB.

Wrapping it all up: the new code, including verifying that the door is actuated before returning and the `healthz` endpoint, is available in this gist.

Note that the healthz endpoint doesn't actually verify that the relay can be successfully energized; however, failures at that point are logged and notified from the CPaaS layer.

Here's an image of the finished hardware setup. I've put the components on some prototype board instead of letting them dangle.

Conclusions

After setting this system up, it still doesn't work right, because many couriers just refuse to read the delivery instructions before texting via the Amazon app. Argh! The next step is to put a little notice on the door to remind the couriers to check the delivery instructions.

The good news is that the system works well for guests; whenever people need access to get to our apartment, we can have them let themselves in without having to use the finicky door buzzer.

¹Yeah, this blog post, like all posts, is super late.

²Actually, it might be better to see whether or not there's voltage across the door button terminals. Unfortunately, experiments indicated that enough current to light the LED inside the optoisolator would also trigger the door mechanism, so we have this solution instead: we can only observe the circuit when we are actively changing it. It still solves the open-loop problem.

Monday, May 27, 2019

Robots talking out loud to each other: Adventures with the Assistant API

Ever watch a science fiction movie or TV show where, in a near-future setting, two humanoid robots meet up on the street and somehow converse with each other in spoken English?

Robot Hell, as depicted in Futurama

It doesn't make any sense. The first robot takes their state, serializes it through a complicated text-to-speech system, and then converts this already lossy data into actual sound waves. Then the second robot captures the sound waves via a microphone (adding noise to the problem), deserializes the waveforms via an even more complicated speech-to-text system, and then somehow parses and comprehends it. It's an incredibly lossy communications path.

The robots are connected to some sort of network! They can just exchange bits directly! Any protocol would be better than this one! Why would anybody do this?

Well, about a year ago (yes, this post is very overdue), I realized that the Google Assistant API is actually this interface that will result in this robot hellscape of the future.

Find My Device API

A slight aside for why this project might be useful in any sense (however, it isn't).

Android has a nice feature called "Find My Device," which un-mutes your phone (if it is in silent mode) and starts ringing it regardless of what state it's in. This feature comes stock as soon as you hook up your Google account, and doesn't require you to install any additional apps, which makes it perfect for the first time you lose your phone.

I wanted to hook this up to one of the spare Amazon WiFi buttons that I had lying around to ring my phone, but unfortunately, there's no actual API endpoint to contact the Find My Device service. However, the Google Assistant is able to contact the service and the Google Assistant has an API!

Let's take a looksie at the nice gRPC API that's provided for the Google Assistant. Note that I will use the documentation for the deprecated version of the API to first discuss the initial state that I found the world in, version v1alpha1 (an update is at the end of this post with the diff from the current version). The API is straightforward, a single rpc with the signature

rpc Converse(ConverseRequest) returns (ConverseResponse)

If you click on the ConverseRequest you'll notice it only has two fields possible: config, which provides information about how to process the request (such as the audio rate frequency), and audio_in, the audio bytes that you'd like the assistant to process for you.

That's it. That's the state of the first rendition of the API. There's no way to pass text in or get text out. The only thing you're allowed to communicate is audio data.

What can one do with this API? Well, we'll try the obvious: we can use eSpeak or any other synthesizer to create audio to give to the assistant, and then we can use any sort of speech recognition library to parse the output.

Robots talking to each other

We can quickly hack up a solution around the examples library (link to GitHub commit; the changes aren't actually very substantial, just a thin wrapper around the already existing audio helper classes) and be able to have some sort of modicum of success (the new flags are --use-text-input and --use-text-output):

python3 pushtotalk.py --device-model-id 'model' \
--device-id 'device' --use-text-input --use-text-output

INFO:root:Connecting to embeddedassistant.googleapis.com
: where's my phone
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "worm".
INFO:root:Transcript of user request: "worms".
INFO:root:Transcript of user request: "worms my".
INFO:root:Transcript of user request: "worms tomorrow".
INFO:root:Transcript of user request: "worms  tomorrow".
INFO:root:Transcript of user request: "worms  my phone".
INFO:root:Transcript of user request: "worms my phone".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "where's my phone".
INFO:root:Expecting follow-on query from user.
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: I found a few phones the first listed is a pixel 3XL should I ring it
: no 
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "no".
INFO:root:Transcript of user request: "no".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "no".
INFO:root:Expecting follow-on query from user.
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: how about the CLT l29
: yes
INFO:root:Recording audio request.
INFO:root:Transcript of user request: "new".
INFO:root:Transcript of user request: "yes".
INFO:root:Transcript of user request: "yes".
INFO:root:End of audio request detected.
INFO:root:Stopping recording.
INFO:root:Transcript of user request: "yes".
INFO:root:Playing assistant response.
INFO:root:Finished playing assistant response.
INFO:root:Recognized response from Assistant: alright your phone should be ringing now

Note that sometimes the assistant isn't actually fast enough to correct the initial "worms my phone" that it gets from our text to speech engine into "where's my phone" and it fails. I've found that the query "where is my phone" is a little more effective at making this not flaky (stupid future...).

But now we can script around this and automate ringing our phone(s)! More on the iptables configuration there possibly in a future post; this blog post is already more than a year overdue...

Note that for the specific query Where's My Phone/Tablet?, the Google Assistant will typically answer "I found a few phones; the first listed is an XXX, would you like me to ring that?" (if you're me; if you have a less complicated digital life and don't hate the future as much, you might avoid this problem entirely and your phone will just immediately start ringing, solving the interaction problem immediately). The order here seems non-deterministic, but the good news is that the names of the phones seems configurable via Play Store settings, which means that if our speech-to-text engine were having trouble understanding them, we could rename them into difficult-to-confuse words.

Full disclaimer: Due to the long time it took me to actually get around to putting all of these pieces together, some details here aren't fully accurate anymore. A sharp-eyed reader might notice that the current version of the API (and the version that I actually did end up forking from the google-assistant-sdk sample library) supports text input, so we didn't actually have to use eSpeak to create the audio file given to the assistant API.

However, it's not actually usable in this mode: when you give the assistant a text input, it only sometimes gives you text output back; most of the time it just streams you back an audio response. Specifically, for the query Where's My Phone?, the response query seems never to be given via text.

Regardless of what you're doing, for some reason, one of the robots on that street will always be speaking back in English via audio. Stupid future.

Futurama is a trademark of Twentieth Century Fox Film Corporation. Google, Play Store and Google Home are trademarks of Google LLC. Wi-Fi is a trademark of the Wi-Fi Alliance. The trademark owners listed (and any others not listed) are not affiliated with and do not endorse this blog post.