Last updated Nov 29, 2025
Prediction
Chamath
As training data pipelines become more curated and refined, the cost of training competitive frontier models will rise to levels that are untenable for most companies, leaving only a few large tech firms able to afford state‑of‑the‑art model training over the next several years.
And it looks like one of those variables that people are looking at is how you basically take the internet not as raw data, but then you actually kind of refine it and refine it some more and then use that as the basis of learning. And what that does is it drives up model costs to a degree that are probably untenable for most companies except, but for a few.View on YouTube
Explanation

Available evidence by late 2025 shows that the cost of training truly frontier models has risen extremely fast and is now concentrated in a very small set of very large or very heavily funded firms.

Analyses of 40+ frontier models find that amortized hardware and energy costs for the largest training runs have grown about 2–3× per year since 2016, with projected single‑run costs exceeding $1B by around 2027; the authors explicitly conclude that such runs will be affordable only to the most well‑funded organizations. (epoch.ai) Public cost estimates for specific 2023–2025 frontier models such as GPT‑4, Gemini Ultra, Llama 3.1‑405B, Mistral Large and xAI’s Grok‑2 put training costs in the tens to hundreds of millions of dollars per model. (visualcapitalist.com) Anthropic CEO Dario Amodei has stated that training current frontier systems typically costs around $100M, with some models already "more like a billion" to train, and has noted that most startups cannot realistically enter this race. (entrepreneur.com)

At the same time, actual frontier‑model training is dominated by a narrow group: OpenAI (with Microsoft), Google/DeepMind, Meta, Anthropic (backed by Amazon, Google, and now Microsoft/Nvidia), xAI (backed by Elon Musk and large GPU commitments), and a few well‑funded newcomers like Mistral; more recently Microsoft itself is building massive in‑house frontier models on clusters of tens of thousands of top‑end GPUs. (theverge.com) The scale of required compute is illustrated by deals such as Anthropic’s commitment to spend ~$30B on Azure compute, something far beyond the reach of ordinary companies. (reuters.com) Most other firms either fine‑tune or deploy smaller models, or buy access to these frontier models via APIs, rather than training competitive SOTA systems from scratch.

Chamath’s causal story—that more refined/curated pipelines (including sophisticated data selection, filtering, and alignment/feedback stages) drive up total development cost—fits with how these labs now emphasize extensive data engineering and RLHF-style processes on top of enormous compute budgets, even if compute remains the dominant cost line item. The exact contribution of data curation versus raw compute is hard to separate, but the overall effect he predicted—frontier training costs rising to levels “untenable for most companies” and de facto restricted to a few giant, well‑funded players—is clearly borne out by the 2024–2025 evidence. Given that this dynamic is already present less than two years after the prediction, the forecast is best scored as right, even though the phrase “over the next several years” extends beyond 2025.