Recently, an IDEO colleague suggested that the teams that laugh the most are also the teams that are the most successful. I loved this idea, and realized that it might be possible to build an actual laugh detector to test the hypothesis. I decided to go for it.
For me, the first step in any machine learning (ML) project is conducting a literature review, which is just a fancy academic term for “ Googling stuff.” ML problems can be finicky, and it’s difficult to predict which approach will be successful ahead of time. Learning from the work of others is a massive boost, particularly when the goal is to move rapidly. When I was in graduate school, a friend of mine would always say, “A month working in the lab can save you a day in the library.”
My search turned up a lot of great work, including a company that lets you use emotion to control software, a room that monitors your mood and reacts to it, and a technique that recognizes emotion using wireless signals (seriously). I also discovered more than a few laughter detection techniques, including some that required the application of electrodes to the chest and face, and an advanced system that fuses several input data streams to provide real-time detection.
These all provided some pretty great inspiration, but the best resource I discovered was not actually related to laughter or emotion. Audioset is one of the largest publicly-available datasets of labeled audio clips, with more than two million annotated with labels from more than 500 categories. In addition to the data, the Audioset team has also released a pre-trained model and architecture for generalized audio classification that can be used as a feature vectorizer for raw audio input.
At their best, machine learning algorithms are only capable of reproducing patterns that exist in the training set, so it’s important to consider any existing structure or bias that could fundamentally characterize the way that the trained algorithm behaves during inference.
I decided to use Audioset as the basis of my laughter detection algorithm, largely because of its accessibility and size. However, the entire dataset was too large and diverse for my problem; I really only care about detecting one kind of noise. So I created my own subset of the Audioset data that contained all the examples of laughter, along with an equal number of examples that did not contain laughter. I limited the non-laughter examples to human sounds. I skipped sounds like musical instruments, farm animals, and explosions, since we’re unlikely to record them in a project space. Including those sounds in the training set might lead to an algorithm that achieves a high accuracy during training, but under performs during inference.
After creating a training dataset, I got started on the machine learning challenge. The Audioset data does not contain the extracted raw audio, only a 128-length feature vector for each second of input, which is created using a convolutional network operating on the raw audio spectrogram. These feature vectors will be the input for the machine learning algorithm, and the output will be a binary label of whether the input contained laughter. Since I am working with a relatively large dataset, I wanted to use some deep learning methods, so I turned to keras, an API that makes creating and training neural networks extremely simple, while still leveraging the computational efficiency of optimized tensor computation libraries like TensorFlow and Theano.
Since the model input is sequential, I wanted to try a recurrent neural network first. I started with a single layer LSTM model that quickly converged to 87% accuracy. Learning from the code in a very similar project, I found that applying batch normalization to the LSTM input was very important for getting the model to converge. One of the headaches of deep learning is that seemingly trivial details like this can have a large impact on convergence. I was happy with the performance of my first LSTM model, but I decided to try out a couple more options before moving on. I tried a 3-layer LSTM (because if one layer is good, three must be better), then I tried a simple logistic regression model to see if the fancy RNN architecture made a difference. The models performed with 88% and 86% accuracy, respectively, which showed me that, in this case, more layers did not equate to more power.
Ultimately, I ended up choosing the single layer LSTM, even though it was larger than the logistic regression model, because it was able to handle variable length input. The training data was all sequences that were 10 seconds long, but I wanted the laughter detector to be able to respond quicker than that. While the logistic regression could only operate on input that was the same length as the training data, the LSTM model can still operate on any input length, as long as it is split into one-second chunks.
To take the trained model and run it as a live laugh detector, I put together a python script that pulled together a few different steps. First, the audio is captured from the microphone using pyaudio and chunked into three-second clips. (The length of these clips is an adjustable parameter.) The raw audio is then fed into the pre-trained vggish network and converted to a sequence of feature vectors with the same parameters I used for creating the Audioset training data. The sequence is then fed into the trained LSTM model, and the laughter prediction score is returned and written to a timestamped .csv file. By default, the raw audio is then discarded, and only the score is saved. That way, there are no concerns about storing or transmitting anyone’s private conversations. By running the audio capture and processing in separate threads, I was able to run this continuously without any lag on my laptop.
I then prototyped two different ways to visualize the output of the laugh detector. First, I built a dashboard inspired by Fitbit (using the Dash library) that aggregates and plots the output from the csv file. I also added an option into the live processing script that can control a Philips Hue bulb using the phue library.
The only thing that was left was to actually run the detector and count up the laughs. I left it running in a project space for a few days (I let everyone know I was recording first) and was able to see how often we actually laughed. Turns out, we weren’t having a funny week so the laugh count was pretty low.
If you want to run your own laugh detector or use this as a starting point, I’ve put all my code online. If you use it to build something cool, or if you just want to chat about it, feel free to reach out and say hi!
Read more about this project on the IDEO blog. Header illustration by Brian Standeford.