SUBSCRIBERS

AI firms will soon exhaust most of the Internet’s data

Can they create more?

Published Thu, Jul 25, 2024 · 05:00 AM
    • In part because of the lack of new textual data, leading models like OpenAI’s GPT-4o are now let loose on image, video and audio files as well as text during their self-supervised learning.
    • In part because of the lack of new textual data, leading models like OpenAI’s GPT-4o are now let loose on image, video and audio files as well as text during their self-supervised learning. PHOTO: REUTERS

    DeeperDive is a beta AI feature. Refer to full articles for the facts.

    IN 2006, Fei-Fei Li, then at the University of Illinois, now at Stanford University, saw how mining the Internet might help to transform artificial intelligence (AI) research. Linguistic research had identified 80,000 “noun synonym sets”, or synsets: groups of synonyms that described the same sort of thing. The billions of images on the Internet, Dr Li reckoned, must offer hundreds of examples of each synset. Assemble enough of them and you would have an AI training resource far beyond anything the field had ever seen. “A lot of people are paying attention to models,” she said. “Let’s pay attention to data.” The result was ImageNet.

    The Internet provided not only the images, but also the resources for labelling them. Once search engines had delivered pictures of what they took to be dogs, cats, chairs or whatever, these images were inspected and annotated by humans recruited through Mechanical Turk, a crowdsourcing service provided by Amazon which allows people to earn money by doing mundane tasks. The result was a database of millions of curated, verified images. It was through using parts of ImageNet for its training that, in 2012, a program called AlexNet demonstrated the remarkable potential of “deep learning” – that is to say, of neural networks with many more layers than had previously been used. This was the beginning of the AI boom, and of a labelling industry designed to provide it with training data.

    The later development of large language models (LLMs) also depended on Internet data, but in a different way. The classic training exercise for an LLM is not predicting what word best describes the contents of an image; it is predicting what a word cut from a piece of text is, on the basis of the other words around it.

    Decoding Asia newsletter: your guide to navigating Asia in a new global order. Sign up here to get Decoding Asia newsletter. Delivered to your inbox. Free.

    Share with us your feedback on BT's products and services