SUBSCRIBERS

AI firms will soon exhaust most of the Internet’s data

Can they create more?

Add BT as a preferred source

Published Thu, Jul 25, 2024 · 05:00 AM

In part because of the lack of new textual data, leading models like OpenAI’s GPT-4o are now let loose on image, video and audio files as well as text during their self-supervised learning. PHOTO: REUTERS

DeeperDive is a beta AI feature. Refer to full articles for the facts.

IN 2006, Fei-Fei Li, then at the University of Illinois, now at Stanford University, saw how mining the Internet might help to transform artificial intelligence (AI) research. Linguistic research had identified 80,000 “noun synonym sets”, or synsets: groups of synonyms that described the same sort of thing. The billions of images on the Internet, Dr Li reckoned, must offer hundreds of examples of each synset. Assemble enough of them and you would have an AI training resource far beyond anything the field had ever seen. “A lot of people are paying attention to models,” she said. “Let’s pay attention to data.” The result was ImageNet.

The Internet provided not only the images, but also the resources for labelling them. Once search engines had delivered pictures of what they took to be dogs, cats, chairs or whatever, these images were inspected and annotated by humans recruited through Mechanical Turk, a crowdsourcing service provided by Amazon which allows people to earn money by doing mundane tasks. The result was a database of millions of curated, verified images. It was through using parts of ImageNet for its training that, in 2012, a program called AlexNet demonstrated the remarkable potential of “deep learning” – that is to say, of neural networks with many more layers than had previously been used. This was the beginning of the AI boom, and of a labelling industry designed to provide it with training data.

The later development of large language models (LLMs) also depended on Internet data, but in a different way. The classic training exercise for an LLM is not predicting what word best describes the contents of an image; it is predicting what a word cut from a piece of text is, on the basis of the other words around it.

Decoding Asia newsletter: your guide to navigating Asia in a new global order. Sign up here to get Decoding Asia newsletter. Delivered to your inbox. Free.

Artificial Intelligence ChatGPT

Share with us your feedback on BT's products and services

Feedback

TRENDING NOW

While residential property has delivered solid gains, observers say other assets such as equities often outpace it over the longer term.

Higher costs, lower returns: Why are Singaporeans still betting on real estate?

Chinese Foreign Minister Wang Yi stopped short of explicitly naming the US or Israel as aggressors, a telling omission from a country that otherwise speaks with little diplomatic timidity.

Beijing’s calculated silence on the Iran war

Visitors at the KL Tower. Industry observers say inbound trip cancellations have surged to approximately 5,000 in the past month.

Malaysia tourism hit by fuel shock; tour prices may jump 50%

Allen & Gledhill partner Dinesh Dhillon led the legal team representing Kiri Industries throughout the decade-long dispute.

Why where you park your joint venture matters: Lessons from a US$689 million shareholder dispute