Last updated Nov 29, 2025
aimarkets
Within approximately one year from March 1, 2024, the market and industry practices for valuing content used to train large language models will become significantly clearer and more settled than they were at the time of the conversation.
Over the next year this will all start to become clearer.View on YouTube
Explanation

Within roughly a year of March 1, 2024, there was a clear shift from mostly unpriced, implicit scraping of web content toward a recognizable market for training data, with repeated, high-profile licensing deals that put concrete dollar values and contractual patterns around content used for large language models.

From spring 2024 onward, OpenAI and other AI firms signed a series of explicit, paid content-licensing agreements with major news organizations and platforms: Le Monde and Spain’s Prisa Media in March 2024; the Financial Times in April 2024; and a large, multi‑year News Corp deal in May 2024 that press reports said could exceed $250 million over five years, following earlier agreements with the Associated Press and Axel Springer. These contracts typically cover both training use and in-product display, and industry reporting indicates that OpenAI and peers now routinely pay anywhere from low single‑digit millions per year for smaller publishers up to nine‑figure packages for marquee groups, establishing de facto price bands for premium text data. (openai.com) In parallel, platform data deals such as Reddit’s reported $60 million‑per‑year agreement with Google for API access, plus subsequent AI partnerships, reinforced the notion that large conversational datasets have substantial, quantifiable market value. (time.com) Broader industry analysis by mid‑ to late‑2024 described a “data gold rush” in which big tech firms were systematically hedging legal and supply risk by buying or licensing training data (e.g., Shutterstock image/video/music libraries, Defined.ai brokered datasets) at increasingly standardised per‑asset or per‑word prices, further clarifying commercial norms for AI training data. (wifc.com)

At the same time, the legal framework around copyright and fair use for training remained unresolved, with major lawsuits such as The New York Times and other publishers vs. OpenAI and Microsoft allowed to proceed into discovery in 2024–2025 rather than being definitively settled, and new legislation like the U.S. Generative AI Copyright Disclosure Act only beginning to address transparency obligations. (apnews.com) That ongoing litigation shows the law is not fully settled, but it does not negate the prediction: by early 2025, there was a much clearer, widely reported pattern of how AI companies and rights‑holders were valuing and transacting over training data (license-or-litigate, with well‑understood deal structures and price ranges), compared with the far murkier situation in early 2023–early 2024. Given that the claim was that things would start to become clearer over the following year, not that every legal and economic question would be fully resolved, the prediction is best judged as having come true.