{"id":233,"date":"2019-06-09T14:12:51","date_gmt":"2019-06-09T21:12:51","guid":{"rendered":"http:\/\/codingrestart.com\/home\/?p=233"},"modified":"2019-06-14T04:37:23","modified_gmt":"2019-06-14T11:37:23","slug":"dangers-of-nlp","status":"publish","type":"post","link":"https:\/\/codingrestart.com\/home\/dangers-of-nlp\/","title":{"rendered":"Dangers of NLP"},"content":{"rendered":"\n<p>Natural language processing (NLP) continues its rapid advance, leading some people to fear its latest results. <\/p>\n\n\n\n<p>The research organization <a href=\"https:\/\/en.wikipedia.org\/wiki\/OpenAI\">OpenAI<\/a> published a blog post titled &#8220;<a href=\"https:\/\/openai.com\/blog\/better-language-models\/\">Better Language Models and Their Implications<\/a>&#8221; summarizing its progress on &#8220;predicting the next word, given all of the previous words within some text&#8221;. OpenAI calls its latest model GPT-2. Some samples of generated texts are very high quality, while some texts show that future improvements are needed. The uniqueness of GPT-2 is the fact that it was not trained on domain-specific datasets as is typical for most NLP models. Rather, GPT-2 was trained on popular links from readers on Reddit, as measured by <a href=\"https:\/\/www.quora.com\/What-is-Reddit-karma-and-how-do-people-benefit-from-having-more-of-it\">karma<\/a> ratings on the outbound links. OpenAI scraped 40GB of text from the Internet to use as training and testing data. <\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"236\" data-permalink=\"https:\/\/codingrestart.com\/home\/dangers-of-nlp\/gpt-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?fit=852%2C463&amp;ssl=1\" data-orig-size=\"852,463\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"GPT-2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?fit=840%2C456&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?resize=597%2C324\" alt=\"\" class=\"wp-image-236\" width=\"597\" height=\"324\" srcset=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?w=852&amp;ssl=1 852w, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?resize=300%2C163&amp;ssl=1 300w, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2.png?resize=768%2C417&amp;ssl=1 768w\" sizes=\"auto, (max-width: 597px) 85vw, 597px\" \/><figcaption>Human text (first two lines) and GPT-2 generated text. Sample published in an OpenAI blog titled &#8220;Better Language Models&#8221;.<\/figcaption><\/figure>\n\n\n\n<p>OpenAI was <a href=\"https:\/\/www.bbc.com\/news\/technology-35082344\">founded<\/a> by some of the biggest names in the tech industry (Elon Musk and Sam Altman) to &#8220;freely collaborate&#8221; with others. But it is not only OpenAI that follows this ethos. The whole AI community has been built on the premise of sharing research and patents. The AI community was then surprised that OpenAI decided not to publish its full GPT-2 model. OpenAI claims that if GPT-2 was published, it could be easily misused, possibly by creating high-quality fake news. OpenAI decided to publish its smaller (and less accurate) versions of GPT-2 to the public.<\/p>\n\n\n\n<p>However, the original paper &#8220;<a href=\"https:\/\/d4mucfpksywv.cloudfront.net\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf\">Language Models are Unsupervised Multitask Learners<\/a>&#8221; from OpenAI may contain enough details to replicate the full GPT-2 model, given enough time and money. It is estimated that training GPT-2 costs $40K in computing resources. <\/p>\n\n\n\n<p>An AI student from Germany <a href=\"https:\/\/medium.com\/@NPCollapse\/replicating-gpt2-1-5b-86454a7f26af\">claims<\/a> he did just that by replicating GPT-2 with the data he scraped using the same methodology as OpenAI.  He wrote a <a href=\"https:\/\/medium.com\/@NPCollapse\/gpt2-counting-consciousness-and-the-curious-hacker-323c6639a3a8\">blog post<\/a> explaining that he will release the full model in a few weeks &#8211; unless somebody provides arguments to sway him from publishing. One may dismiss it as a publicity stunt until one notices the <a href=\"https:\/\/github.com\/ConnorJL\/GPT2\">GitHub<\/a> repo containing the whole NLP pipeline. It includes Python code for loading the data and for training TensorFlow models at various quality. The author also publishes trained smaller size models that are not as accurate as the large model. The author is kind enough to include examples of his predictions for the same samples published by OpenAI. He claims that, unlike OpenAI&#8217;s blog, he does not cherry pick examples. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"797\" height=\"552\" data-attachment-id=\"240\" data-permalink=\"https:\/\/codingrestart.com\/home\/dangers-of-nlp\/gpt-2-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?fit=797%2C552&amp;ssl=1\" data-orig-size=\"797,552\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"GPT-2-2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?fit=797%2C552&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?resize=797%2C552\" alt=\"\" class=\"wp-image-240\" srcset=\"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?w=797&amp;ssl=1 797w, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?resize=300%2C208&amp;ssl=1 300w, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/GPT-2-2.png?resize=768%2C532&amp;ssl=1 768w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><figcaption>Text generated by non-OpenAI model described above using the same leading lines.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Natural language processing (NLP) continues its rapid advance, leading some people to fear its latest results. The research organization OpenAI published a blog post titled &#8220;Better Language Models and Their Implications&#8221; summarizing its progress on &#8220;predicting the next word, given all of the previous words within some text&#8221;. OpenAI calls its latest model GPT-2. Some &hellip; <a href=\"https:\/\/codingrestart.com\/home\/dangers-of-nlp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Dangers of NLP&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[4],"tags":[],"class_list":["post-233","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paAAlH-3L","jetpack-related-posts":[{"id":238,"url":"https:\/\/codingrestart.com\/home\/mona-lisas-video\/","url_meta":{"origin":233,"position":0},"title":"Mona Lisa&#8217;s video","author":"Viktor Sanek","date":"June 16, 2019","format":false,"excerpt":"For centuries, people have wondered about Mona Lisa's smile. Now they can stop wondering and just watch her videos. A group of AI researchers published a paper titled \"Few-Shot Adversarial Learning of Realistic Neural Talking Head Models\", where they describe a new algorithm to generate videos of peoples' heads (talking\u2026","rel":"","context":"In &quot;Machine Learning&quot;","block_context":{"text":"Machine Learning","link":"https:\/\/codingrestart.com\/home\/category\/machine-learning\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/Gans.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/Gans.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/Gans.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/06\/Gans.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":88,"url":"https:\/\/codingrestart.com\/home\/webassembly\/","url_meta":{"origin":233,"position":1},"title":"WebAssembly","author":"Viktor Sanek","date":"February 3, 2019","format":false,"excerpt":"Until recently, JavaScript was the only language supported by all web browsers. But it was difficult to compile it efficiently and the JavaScript applications typically run much slower than native applications. Then, programmers from the four main browser vendors designed a new language, sort of a machine code for the\u2026","rel":"","context":"In &quot;Web&quot;","block_context":{"text":"Web","link":"https:\/\/codingrestart.com\/home\/category\/web\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/01\/WASM.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/01\/WASM.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/01\/WASM.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/codingrestart.com\/wp-content\/uploads\/2019\/01\/WASM.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":1,"url":"https:\/\/codingrestart.com\/home\/hello-programmers\/","url_meta":{"origin":233,"position":2},"title":"Hello programmers!","author":"Viktor Sanek","date":"May 7, 2017","format":false,"excerpt":"Welcome to CodingRestart.com, blog for seasoned developers that want to upgrade their skills. If any of the following is true for you, this site is for you: You feel threatened by younger colleagues, because you do not know popular programming languages and buzzwords they use. You think Kernighan and Ritchie's\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":69,"url":"https:\/\/codingrestart.com\/home\/ml-msdn-and-xavier\/","url_meta":{"origin":233,"position":3},"title":"ML: MSDN and Xavier","author":"Viktor Sanek","date":"January 19, 2019","format":false,"excerpt":"Microsoft\u2019s MSDN magazine for its development community has been publishing quite a few introductory articles to Machine Learning (ML) over the past few months. January's issue emphasizes ML with another series of articles, albeit with differing qualities. I liked this quote from the editorial \"Advancing AI\": \u201dML is a huge\u2026","rel":"","context":"In &quot;Machine Learning&quot;","block_context":{"text":"Machine Learning","link":"https:\/\/codingrestart.com\/home\/category\/machine-learning\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":171,"url":"https:\/\/codingrestart.com\/home\/theory-behind-anki\/","url_meta":{"origin":233,"position":4},"title":"Theory Behind Anki","author":"Viktor Sanek","date":"April 23, 2019","format":false,"excerpt":"This post covers the theory behind Anki and follows the previous post introducing Anki. The last post in this series describes best practices for using Anki. One of the best meta-analyses reviewing best approaches to learning is Improving Students\u2019 Learning With Effective Learning Techniques by Dunlosky et al. If I\u2026","rel":"","context":"In &quot;Anki&quot;","block_context":{"text":"Anki","link":"https:\/\/codingrestart.com\/home\/category\/anki\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":30,"url":"https:\/\/codingrestart.com\/home\/why-is-smart-pointer-better\/","url_meta":{"origin":233,"position":5},"title":"Why is smart pointer better than a raw one?","author":"Viktor Sanek","date":"May 21, 2017","format":false,"excerpt":"In a previous post\u00a0I suggested that raw pointers should be replaced by smart pointers. This post explains why smart pointers are better than raw ones and how to convert an existing code using raw pointers to use smart pointers. For the sake of this post, when I refer to smart\u2026","rel":"","context":"In &quot;C\/C++&quot;","block_context":{"text":"C\/C++","link":"https:\/\/codingrestart.com\/home\/category\/uncategorized\/cc\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/posts\/233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/comments?post=233"}],"version-history":[{"count":8,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/posts\/233\/revisions"}],"predecessor-version":[{"id":252,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/posts\/233\/revisions\/252"}],"wp:attachment":[{"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/media?parent=233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/categories?post=233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/codingrestart.com\/home\/wp-json\/wp\/v2\/tags?post=233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}