Since Siri was introduced in 2010, the world has been increasingly enamored with voice interfaces. When we need to adjust the thermostat, we ask Alexa. If we want to put on a movie, we ask the remote to search for it. According to some estimates, some 33 million voice-enabled devices will be in American homes by the end of this year.
But there are limitations to voice-enabled interactions. They’re slow, embarrassing when other humans are around, and require awkward trigger phrases like “Okay, Google” or “Hey, Siri.” Thankfully, though, they’re no longer our only—or best—option. A sea change is coming to the cameras in our pockets. The new iPhone introduced a camera that can perceive three dimensions and record a depth for every pixel, and home devices like the Nest IQ and Amazon’s Echo Look now have cameras of their own. Combined with neural nets that learn and improve with more training data, these new cameras create a point cloud or depth map of the people in a scene, how they are posing, and how they are moving. The nets can be trained to recognize specific people, classify their activities, and respond to gestures from afar. Together, neural nets and better cameras open up an entirely new space for gestural design and gesture-based interaction models.
These new options beg the question: Of the existing interaction modalities—haptics (touch), sound (voice), and vision (gesture)—which is better to use when, and why?
Some of the answers lie in the context of use. Sometimes, certain modalities aren’t available for communication, or are saturated by other tasks. When you’re SCUBA diving, or water skiing, or directing traffic on the deck of an aircraft carrier, the auditory channel isn’t available, so gesture or touch become essential. In an operating room, a surgeon’s hands are sterile; she can’t flip through radiology scans—only speech and gesture are available. If you’re conducting an orchestra or on a military raid you can’t call out commands, so we’re back with gesture.
To dig into it further, our team at the Cambridge studio snagged a camera like the one in the new iPhone and performed a series of experiments to figure out when gesture might be the best choice.
First we gave pairs of people an idea, then asked them to make a four-handed pose to express that idea, to understand if we could train a neural network to recognize personal expressive gestures.
Then we recorded stories and tracked people’s hands using computer vision to study when we naturally deploy gestures to amplify emotion or explain a concept.
Third, we explicitly asked people to invent their own gesture for some common actions that happen around the house, office, and while driving.
Lastly, we trained a neural network to recognize a small set of gestures, and used these to control a Philips HUE light set and a Spotify station to create an installation for the office.
In messing around with these exercises, we discovered that gestures need to be either sequential like a sentence—noun then verb, object and operation (for example, “speaker, on;” or two-handed, where one hand designates the noun, and the other the verb (in other words, point to speaker with left hand, turn volume up by raising right hand.)
We also noticed that gestures are generation-specific. When asked to invent a gesture to turn up the volume, some people used a knob-rotating gesture, but people under 30 tended to use a more generic palm raised lifting gesture, or even a pinch.
After analyzing our results, we boiled our thoughts down to four reasons to opt for gesture over voice or touch, a little rubric to help us figure out which to use when:
- Speed: If it needs to be fast, gestures are much quicker than speaking sentences.
- Distance: If you need to communicate from across the room, gesture is easier than dealing with volume.
- Limited lexicon: If you don’t have a thousand things to say, gestures work well. The smaller the gesture set for a given context, the easier it is to remember. (Thumbs-up/thumbs-down, for example.)
- Expressiveness over precision: Gestures are well-suited to expressing emotional salience. A musical conductor communicates a downbeat and tempo, but also so much more: dolco, marcato, confidence, sadness, longing, and more.
Now that voice commands don’t have to be the dominant interaction model, I’m curious to see which product categories will lead the way to take advantage of gesture’s natural advantages—subtlety, expressiveness, and speed—and how we might use gesture in unexpected ways. What kind of experience do you want to prototype with gesture? I’d love to hear your thoughts.
Many thanks to the rest of the team: Lisa Tacoronte, Todd Vanderlin, Jason Robinson, Danny DeRuntz, Brian Standeford, Eric Chan, and Ari Adler.