The music video for “Take On Me” by A-ha features a mix of sketched animation and live action. It was done by hand drawing 3000 frames, and took 16 weeks to complete. The video ended up winning 6 awards in the 1986 MTV Music Video Awards.
This effect is pretty striking. But who wants to draw 3000 frames by hand? What if we could automate this process? And what if we could summon the ghost of Pablo Picasso to draw the frames for us? It turns out that we can.
About three months ago, Gatys, Ecker and Bethge from University of Tübingen published A Neural Algorithm of Artistic Style. Their algorithm is able to transfer the visual style of one image onto another, by means of a pretrained deep convolutional neural network. The output images are phenomenal – in my opinion, this is one of the coolest things to come out of the field of machine learning in a while.
Convolutional Neural Networks decompose images into a hierarchy of image filters, where each image filter can be seen as detecting the presence or absence of a particular feature in the image. The lowest level filters detect things like edges and color gradients at different orientations, while the higher level filters detect compositions of the filters below, thus detecting more complex features.
The idea of the neural style transfer algorithm starts with the idea of texture generation. Gatys and gang had previously found that if you find the correlation between feature maps, calculated as the dot product of each of the image filter activations with each other, you get a matrix that says what features are in an image, but ignores where in the image they are. Two images have similar textures if their corresponding texture matrices are similar.
To synthesize a texture from an image, you run back propagation in reverse, starting with random pixels and adjusting the pixels so as to minimize the squared difference between these texture matrices for each layer.
The next thing to do is to figure out a way to synthesize the content of a given image. This is simpler, just adjust the pixels so as to minimize the squared difference between the image filter activations themselves.
We now have a measure of texture similarity and a measure of content similarity. In order to do style transfer, we want to generate an image that has similar textures to one image, and similar content to another image. Since we know how to do that with the two squared differences we just defined, we can just minimize their sum. That is the neural style transfer algorithm.
The hierarchical image filters of a ConvNet have been shown in various ways to be similar to how vision works in human beings. Thus an appealing aspect of the style transfer algorithm is that it makes a quite concrete connection between our perception of artistic style, and the neurons in our brain.
Applying it to Video
Gene Kogan was the first I saw to have the idea of applying this style transfer algorithm to video. The most straight forward way would be to just run the algorithm on each frame seperately. One problem with this is that the style transfer algorithm might end up styling successive frames in very different ways. In order to create smoother transitions between frames, Kogan blended in the stylized version of the previous frame at a low opacity. Check out his awesome rendering of a scene from Alice in Wonderland.
One thing we can do to improve the blending of frames is to calculate the optical flow between the frames. That is, we try to figure out how things move between two frames in the video. We can then take this estimate of motion, and use it to bend and smudge the stylized image with the same motion before blending it in.
Luckily, such an optical flow calculation is included in OpenCV. I’ve uploaded some code on GitHub that takes care of computing flow between frames, morphing the stylized image and blending it in.
The optical flow morphing takes a little bit of computation (~1 second), but it is absolutely dwarfed by the run time of the style transfer algorithm. Currently, it takes about 3 minutes to render a single frame at maximum resolution on a Titan X GPU. And it uses all of its 12GB of memory in order to do so, and that is at sub-HD resolution. In any case, the effect is sweet enough to make it worth the wait.
This is what it looks like when applied to what is a cult classic in certain circles, Bobby Meeks’ part from the 2003 snowboard movie “Lame”:
Here it is applied to the music video for “Islands”, by The xx:
Isn’t that just damn cool?