Mona Lisa’s video

For centuries, people have wondered about Mona Lisa’s smile. Now they can stop wondering and just watch her videos.

A group of AI researchers published a paper titled “Few-Shot Adversarial Learning of Realistic Neural Talking Head Models“, where they describe a new algorithm to generate videos of peoples’ heads (talking heads models). Methods to produce talking heads models using generative adversarial networks (GAN) were already published previously. GANs are essentially two neural networks combined into one system, where one NN is trained to produce samples and the second NN is trained to identify good examples.

However, the existing methods using GANs required long videos or large sets of photographs of each talking head to train GANs. The existing methods used various warping techniques; for an overview read the introduction in the “Facial Animation System Based on Image Warping Algorithm” study.

The above paper describes a new way of producing the talking heads using just a few training examples, possibly only a single photograph. Instead of warping, a direct synthesizing method is used. This method is called few-shot learning and relies on pre-trained models that were trained using a large number of videos of various people in different situations. In those models, a critical part of the training relies on the identification of face landmarks, like eyes, nose, mouth, and chin.

The results of the new research are summarized in a 5 minutes video that shows how the properly trained GAN can produce short talking head videos from still images. A talking head created from the Mona Lisa painting was particularly impressive because it was trained on several human models and the differences in three facial expressions are easily recognizable. The process of video synthesis of a certain person based on the face landmarks of a different person is called puppeteering.

Mona Lisa trained using three different training videos to give her three distinct personalities. Courtesy of Zakharov at al.

Talking head videos could be combined with the latest NLP improvements that I described in an earlier post. This would create highly realistic fake videos and text. If you were concerned about the proliferation of deepfakes before reading this post, this will only heighten your fears. And how are the authors of the above described few-shot adversarial learning algorithm responding to those concerns? The statement below their YouTube video that I linked above states: “Shifting a part of human life-like communication to the virtual and augmented worlds will have several positive effects. It will lead to a reduction in long-distance travel and short-distance commute. It will democratize education, and improve the quality of life for people with disabilities.” Noble enough. But considering that researchers are from Russia and Russia’s proven track of meddling in recent US/EU elections, it is not far-fetched to assume that high-quality deepfakes will be common soon.

How soon? Let’s look at the progress of images generated by GANs over the past five years.

GANs results throughout the years. Please note that none of the above images is a real person. Courtesy of Gidi Shperber.

I’ll let you extrapolate the progress into the future.

Leave a Reply

Your email address will not be published. Required fields are marked *