Natural language processing (NLP) continues its rapid advance, leading some people to fear its latest results.
The research organization OpenAI published a blog post titled “Better Language Models and Their Implications” summarizing its progress on “predicting the next word, given all of the previous words within some text”. OpenAI calls its latest model GPT-2. Some samples of generated texts are very high quality, while some texts show that future improvements are needed. The uniqueness of GPT-2 is the fact that it was not trained on domain-specific datasets as is typical for most NLP models. Rather, GPT-2 was trained on popular links from readers on Reddit, as measured by karma ratings on the outbound links. OpenAI scraped 40GB of text from the Internet to use as training and testing data.
OpenAI was founded by some of the biggest names in the tech industry (Elon Musk and Sam Altman) to “freely collaborate” with others. But it is not only OpenAI that follows this ethos. The whole AI community has been built on the premise of sharing research and patents. The AI community was then surprised that OpenAI decided not to publish its full GPT-2 model. OpenAI claims that if GPT-2 was published, it could be easily misused, possibly by creating high-quality fake news. OpenAI decided to publish its smaller (and less accurate) versions of GPT-2 to the public.
However, the original paper “Language Models are Unsupervised Multitask Learners” from OpenAI may contain enough details to replicate the full GPT-2 model, given enough time and money. It is estimated that training GPT-2 costs $40K in computing resources.
An AI student from Germany claims he did just that by replicating GPT-2 with the data he scraped using the same methodology as OpenAI. He wrote a blog post explaining that he will release the full model in a few weeks – unless somebody provides arguments to sway him from publishing. One may dismiss it as a publicity stunt until one notices the GitHub repo containing the whole NLP pipeline. It includes Python code for loading the data and for training TensorFlow models at various quality. The author also publishes trained smaller size models that are not as accurate as the large model. The author is kind enough to include examples of his predictions for the same samples published by OpenAI. He claims that, unlike OpenAI’s blog, he does not cherry pick examples.