You’ve probably never heard of a GAN, but you’ve likely read stories about what they can do. From generating images of fake celebrities, to creating training data for self-driving cars, and even recreating suggestive (and weird) nude portraits, GANs are everywhere right now. But what exactly is this bleeding edge technology, and what can we do with it?
A GAN (Generative Adversarial Network) is a type of AI that pits two neural networks against each other, each one training the other to improve its ability. Unlike other types of AI, GANs don’t just recognize objects—they can create them. GANs borrow a feature from biology to pit two competing neural networks against each other. One network, the “generator,” produces synthetic images, while another network, the “discriminator,” tries to guess whether images are synthetic or real. The discriminator learns from both synthetic and real images, getting better over time. As it improves, it challenges the generator to produce ever more realistic synthetic images.
I recently took a bite-sized sabbatical from my day job as the data-driven intelligence portfolio lead at IDEO CoLab to dive head first into building and exploring the visual potential of GANs. I study data and machine learning to understand its potential for humans, design, and business, so naturally I wanted to get better acquainted with GANs and all they might do. In this post, I’ll walk you through what I learned in the process of building with them, and how to set up your first GAN.
The importance of your dataset
GANs can generate convincing but fake pictures of almost anything— faces, cats, and even anime characters. How do they do it? It all depends on the dataset. GANs need to be fed a large set of images (think tens of thousands or more) so they can learn patterns. For example, GANs that produce faces learn that they tend to be roughly round, have two eyes, one nose, one mouth, and possibly hair on top. They learn pixel-level patterns, too—the textures that make up skin, eyes, and hair.
The trick with GANs is finding sets of images large enough and diverse enough for a network to pick out patterns. There are plenty out there—like CelebA (hundreds of thousands of celebrity faces) or LSUN (images of scenes like rooms and buildings)—but these are all meant for research, and training with them tends to illustrate how successful GANs can be.
Instead of going the easy route, what if we try to stretch GANs to their limit, so that we can get a better understanding of how they work through an endless alternative source of training datasets? In this case, we’re using images from The Simpsons.
Here’s how we did it.
1. Set up the machine to run Tensorflow on a GPU
Tensorflow is an open source machine learning software framework that can use a machine’s GPU to accelerate training. I followed this guide to install Tensorflow with GPU support on a Linux machine with a hefty Nvidia GeForce GTX graphics card originally bought for virtual reality and gaming. Tensorflow also supports Mac, but unfortunately no longer supports GPU acceleration on Mac, so this ruled out using my laptop.
2. Turn video into still frames
I used a command line tool called ffmpeg to convert mp4 video into png images, and another tool called imagemagick to resize the images in batches. (You can find handy code snippets for using these here and here.) GANs are large neural networks and the higher the resolution of images, the more GPU memory you need (and the slower training progresses). I found that images with a long dimension—around 512 pixels—are a good balance between resolution and speed of training. GPUs with less memory may need a lower resolution.
3. Train and collect the network’s progress
I experimented with a number of different types of GAN architectures, loss functions, and implementations on Github. Ultimately I created my own Github implementation that’s easy to use and has tons of comments in the code. It’s based on the popular and helpful DCGAN repository.
I trained the network overnight (sometimes over multiple days) and saved examples of images created by the generator network. These images document the learning process of the generator. The generator is first initialized with random weights—the strength of connections between neurons in the network—so it outputs random, noisy images. With time, the images look more and more like The Simpsons.
Finally, I again used ffmpeg to turn the output still frames back into a movie, to really see the progress of training over time.
Scrub through the video above to see the network’s progress. At 1:10 ( about 13 hours of training), the first outlines of Simpsons eyes appear. At 2:00 (about 23 hours), the color scheme becomes noticeably brighter and figures start to emerge. At 3:00 (about 34 hours), there is less variation as shapes start to solidify and the networks reach some stabilization on what is classified as “real.” At 5:00 (about 57 hours) and beyond the shapes start to vaguely resemble Simpsons characters, though the result is still too psychedelic and abstract for us to truly classify the images as real Simpsons frames.
Making sense of the output
The GAN manages to generate very Simpsons-esque images, but they’re definitely not going to fool anyone. What’s going on? GANs (and any neural network or machine learning method, for that matter) only learn patterns that are in the training data. As humans, we have a deep understanding of what people and objects should look like, including Simpsons characters and scenes in the show. The GAN, however, doesn’t bring any preexisting knowledge. It has to learn everything from the raw pixel values of the images it sees, and more complex patterns take more examples to learn.
In this case, the GAN easily learns the color palette of The Simpsons and its simple, flat illustration style, and starts to learn the features of each character. With enough training data, it may learn more complex patterns, like complete images of characters, or which characters show up in which settings (like the Simpson family at home, or Mr. Burns at the power plant).
The video below is an example of the film Spirited Away run through the GAN. Notice how the colors, textures, shapes, and evolution are completely different The Simpsons output.
Running other films through a GAN gives an interesting sense of each’s visual complexity. 2001: A Space Odyssey, for example, has a much higher visual complexity and fewer training images (since it’s a three-hour movie rather than a TV show with hundreds of hours of episodes). The GAN output is much less coherent than The Simpsons, but still retains some very 2001-like features. From time to time, you can spot a HAL9000 eye, spacecraft instrument panels, and scenes from the desert opening scene. What’s interesting about this network is that we even start to see a bit of mode collapse at 9:08, where the generator network stops producing a wide variety of samples.
What’s next for GANs
The varied and strange outputs of these networks are a succinct example of both the promise and risks of building products, services, or systems with machine learning inside. They’re seemingly magical, learning from patterns in data without being told what to look for, but they’re also limited by that data. Imagine you came into this world yesterday and had to make sense of it all just from still frames of The Simpsons—how would that inform your thinking? Biases in training data, intended or not, will be reflected in the output of the network, which, if embedded into a product or service, can codify the bias at an unprecedented scale. As designers, we are uniquely positioned to help with this problem. As Kate Crawford, co-founder of the AI Now Institute has said, the problem is neither a specifically social nor technical problem: It’s a socio-technical problem. Designers have a responsibility to understand the unique role data plays in machine learning, so that we can create networks that equitably serve human needs.
Special thanks to CoLab’s Adam Lukasik for his help wrangling GANs for this post.