Painting Video with Neural Networks

take-on-me.jpg

The music video for “Take On Me” by A-ha features a mix of sketched animation and live action. It was done by hand drawing 3000 frames, and took 16 weeks to complete. The video ended up winning 6 awards in the 1986 MTV Music Video Awards.

This effect is pretty striking. But who wants to draw 3000 frames by hand? What if we could automate this process? And what if we could summon the ghost of Pablo Picasso to draw the frames for us? It turns out that we can.

https://i.imgur.com/sb8dHcY.png

A photo styled after various famous paintings, by a neural style transfer algorithm.

About three months ago, Gatys, Ecker and Bethge from University of Tübingen published A Neural Algorithm of Artistic Style. Their algorithm is able to transfer the visual style of one image onto another, by means of a pretrained deep convolutional neural network. The output images are phenomenal – in my opinion, this is one of the coolest things to come out of the field of machine learning in a while.

Convolutional Neural Networks decompose images into a hierarchy of image filters, where each image filter can be seen as detecting the presence or absence of a particular feature in the image. The lowest level filters detect things like edges and color gradients at different orientations, while the higher level filters detect compositions of the filters below, thus detecting more complex features.

Screenshot from 2015-12-15 21:33:02.png

Visualization of what features the image filters look for in the different layers of a ConvNet, along with images that cause their activation. See the paper for details: Visualizing and Understanding Convolutional Networks, Zeiler & Fergus 2013.

Generating Textures

The idea of the neural style transfer algorithm starts with the idea of texture generation. Gatys and gang had previously found that if you find the correlation between feature maps, calculated as the dot product of each of the image filter activations with each other, you get a matrix that says what features are in an image, but ignores where in the image they are. Two images have similar textures if their corresponding texture matrices are similar.

To synthesize a texture from an image, you run back propagation in reverse, starting with random pixels and adjusting the pixels so as to minimize the squared difference between these texture matrices for each layer.

Screenshot from 2015-12-16 21:06:14.png

Textures synthesized to match the images in the bottom row. From top to bottom, as an increasing number of ConvNet layers are taken into account, the more the structure of generated image matches its inspiration. From Texture Synthesis using Convolutional Neural Networks, Gatys, Ecker & Bethge 2015.

 

The next thing to do is to figure out a way to synthesize the content of a given image. This is simpler, just adjust the pixels so as to minimize the squared difference between the image filter activations themselves.

We now have a measure of texture similarity and a measure of content similarity. In order to do style transfer, we want to generate an image that has similar textures to one image, and similar content to another image. Since we know how to do that with the two squared differences we just defined, we can just minimize their sum. That is the neural style transfer algorithm.

The hierarchical image filters of a ConvNet have been shown in various ways to be similar to how vision works in human beings. Thus an appealing aspect of the style transfer algorithm is that it makes a quite concrete connection between our perception of artistic style, and the neurons in our brain.

Applying it to Video

Gene Kogan was the first I saw to have the idea of applying this style transfer algorithm to video. The most straight forward way would be to just run the algorithm on each frame seperately. One problem with this is that the style transfer algorithm might end up styling successive frames in very different ways. In order to create smoother transitions between frames, Kogan blended in the stylized version of the previous frame at a low opacity. Check out his awesome rendering of a scene from Alice in Wonderland.

One thing we can do to improve the blending of frames is to calculate the optical flow between the frames. That is, we try to figure out how things move between two frames in the video. We can then take this estimate of motion, and use it to bend and smudge the stylized image with the same motion before blending it in.

image-00733image-00734

opticalhsv

Two successive video frames, and the optical flow between them. The bottom image shows the direction and magnitude of the motion from the first picture to the second. The color indicates direction, and the color saturation indicates magnitude.

Luckily, such an optical flow calculation is included in OpenCV. I’ve uploaded some code on GitHub that takes care of computing flow between frames, morphing the stylized image and blending it in.

The optical flow morphing takes a little bit of computation (~1 second), but it is absolutely dwarfed by the run time of the style transfer algorithm. Currently, it takes about 3 minutes to render a single frame at maximum resolution on a Titan X GPU. And it uses all of its 12GB of memory in order to do so, and that is at sub-HD resolution. In any case, the effect is sweet enough to make it worth the wait.

This is what it looks like when applied to what is a cult classic in certain circles, Bobby Meeks’ part from the 2003 snowboard movie “Lame”:

Here it is applied to the music video for “Islands”, by The xx:

Isn’t that just damn cool?

12 thoughts on “Painting Video with Neural Networks

  1. Is there a way to approximate the effect so it can be done in real time? For example you are familiar with the pixelization effect used to obscure the nasty private bits in video streams? Instead of a color averaged large block, one might use a ‘neural style’ thumbnail whose color average matches the calculated average of a 8×8 pixel sample in the image. So instead of getting a solid blob of color you get a neural pattern.

  2. The short answer is yes, and you don’t need to approximate. Since I wrote this a lot work has been published on how to speed up neural style transfer, to the extent that it can now run in real time on a phone. People have implemented better uses of optical flow as well, see Manuel Ruders work for example: https://github.com/manuelruder/artistic-videos

  3. wow That such progress was made in so little time. If AI exceeds human cognition, that kind of progress might take a second. Anyway, I was wondering how to use the tiles generated by the hidden layers as components. Something like what this fellow achieved

    From: Lars Eidnes’ blog To: topjetboy@yahoo.com Sent: Monday, April 24, 2017 7:42 AM Subject: [New comment] Painting Video with Neural Networks #yiv1116038182 a:hover {color:red;}#yiv1116038182 a {text-decoration:none;color:#0088cc;}#yiv1116038182 a.yiv1116038182primaryactionlink:link, #yiv1116038182 a.yiv1116038182primaryactionlink:visited {background-color:#2585B2;color:#fff;}#yiv1116038182 a.yiv1116038182primaryactionlink:hover, #yiv1116038182 a.yiv1116038182primaryactionlink:active {background-color:#11729E;color:#fff;}#yiv1116038182 WordPress.com larseidnes commented: “The short answer is yes, and you don’t need to approximate. Since I wrote this a lot work has been published on how to speed up neural style transfer, to the extent that it can now run in real time on a phone. People have implemented better uses of optica” | |

Leave a comment