Mona Lisa’s video

For centuries, people have wondered about Mona Lisa’s smile. Now they can stop wondering and just watch her videos.

A group of AI researchers published a paper titled “Few-Shot Adversarial Learning of Realistic Neural Talking Head Models“, where they describe a new algorithm to generate videos of peoples’ heads (talking heads models). Methods to produce talking heads models using generative adversarial networks (GAN) were already published previously. GANs are essentially two neural networks combined into one system, where one NN is trained to produce samples and the second NN is trained to identify good examples.

However, the existing methods using GANs required long videos or large sets of photographs of each talking head to train GANs. The existing methods used various warping techniques; for an overview read the introduction in the “Facial Animation System Based on Image Warping Algorithm” study.

The above paper describes a new way of producing the talking heads using just a few training examples, possibly only a single photograph. Instead of warping, a direct synthesizing method is used. This method is called few-shot learning and relies on pre-trained models that were trained using a large number of videos of various people in different situations. In those models, a critical part of the training relies on the identification of face landmarks, like eyes, nose, mouth, and chin.

The results of the new research are summarized in a 5 minutes video that shows how the properly trained GAN can produce short talking head videos from still images. A talking head created from the Mona Lisa painting was particularly impressive because it was trained on several human models and the differences in three facial expressions are easily recognizable. The process of video synthesis of a certain person based on the face landmarks of a different person is called puppeteering.

Mona Lisa trained using three different training videos to give her three distinct personalities. Courtesy of Zakharov at al.

Talking head videos could be combined with the latest NLP improvements that I described in an earlier post. This would create highly realistic fake videos and text. If you were concerned about the proliferation of deepfakes before reading this post, this will only heighten your fears. And how are the authors of the above described few-shot adversarial learning algorithm responding to those concerns? The statement below their YouTube video that I linked above states: “Shifting a part of human life-like communication to the virtual and augmented worlds will have several positive effects. It will lead to a reduction in long-distance travel and short-distance commute. It will democratize education, and improve the quality of life for people with disabilities.” Noble enough. But considering that researchers are from Russia and Russia’s proven track of meddling in recent US/EU elections, it is not far-fetched to assume that high-quality deepfakes will be common soon.

How soon? Let’s look at the progress of images generated by GANs over the past five years.

GANs results throughout the years. Please note that none of the above images is a real person. Courtesy of Gidi Shperber.

I’ll let you extrapolate the progress into the future.

Dangers of NLP

Natural language processing (NLP) continues its rapid advance, leading some people to fear its latest results.

The research organization OpenAI published a blog post titled “Better Language Models and Their Implications” summarizing its progress on “predicting the next word, given all of the previous words within some text”. OpenAI calls its latest model GPT-2. Some samples of generated texts are very high quality, while some texts show that future improvements are needed. The uniqueness of GPT-2 is the fact that it was not trained on domain-specific datasets as is typical for most NLP models. Rather, GPT-2 was trained on popular links from readers on Reddit, as measured by karma ratings on the outbound links. OpenAI scraped 40GB of text from the Internet to use as training and testing data.

Human text (first two lines) and GPT-2 generated text. Sample published in an OpenAI blog titled “Better Language Models”.

OpenAI was founded by some of the biggest names in the tech industry (Elon Musk and Sam Altman) to “freely collaborate” with others. But it is not only OpenAI that follows this ethos. The whole AI community has been built on the premise of sharing research and patents. The AI community was then surprised that OpenAI decided not to publish its full GPT-2 model. OpenAI claims that if GPT-2 was published, it could be easily misused, possibly by creating high-quality fake news. OpenAI decided to publish its smaller (and less accurate) versions of GPT-2 to the public.

However, the original paper “Language Models are Unsupervised Multitask Learners” from OpenAI may contain enough details to replicate the full GPT-2 model, given enough time and money. It is estimated that training GPT-2 costs $40K in computing resources.

An AI student from Germany claims he did just that by replicating GPT-2 with the data he scraped using the same methodology as OpenAI. He wrote a blog post explaining that he will release the full model in a few weeks – unless somebody provides arguments to sway him from publishing. One may dismiss it as a publicity stunt until one notices the GitHub repo containing the whole NLP pipeline. It includes Python code for loading the data and for training TensorFlow models at various quality. The author also publishes trained smaller size models that are not as accurate as the large model. The author is kind enough to include examples of his predictions for the same samples published by OpenAI. He claims that, unlike OpenAI’s blog, he does not cherry pick examples.

Text generated by non-OpenAI model described above using the same leading lines.