‘If journalism is going up in smoke, I might as well get high off the fumes’: confessions of a chatbot helper – The Guardian

Without better language data, these language models simply cannot improve. Their world is our word. Hold on. Aren’t these machines trained on billions and billions of words and sentences? What would they need us fleshy scribes for? Well, for starters, the internet is finite. And so too is the sum of every word on every page of every book ever written. So what happens when the last pamphlet, papyrus and prolegomenon have been digitised and the model is still not perfect? What happens when we run out of words? The date for that linguistic apocalypse has already been set. Researchers announced in June that we can expect this to take place between 2026 and 2032 “if current LLM development trends continue”. At that point, “Models will be trained on datasets roughly equal in size to the available stock of public human text data.” Note the word human. […]

If technology companies can throw huge amounts of money at hiring writers to create better training data, it does slightly call into question just how “artificial” current AIs really are. The big technology companies have not been “that explicit at all” about this process, says Chollet, who expects investment in AI (and therefore annotation budgets) to “correct” in the near future. Manthey suggests that investors will probably question the “huge line item” taken up by “hefty data budgets”, which cover licensing and human annotation alike.

Leave a comment