The other crazy thing that he said subsequent versions of grok are not going to be trained on any traditional data set that exists in the wild.
Chamath is paraphrasing Elon Musk’s claim on the All-In podcast that later Grok models would move away from “traditional datasets that exist in the wild” and instead be trained via agents generating synthetic data.
Public evidence about Grok’s evolution shows that xAI has not stopped using conventional human-created corpora:
- Regulators in Europe opened an investigation into X’s use of public posts from EU users to train Grok’s LLMs, explicitly describing Grok as trained on large scraped online datasets (articles, blog posts, and social‑media content). This is classic “in‑the‑wild” human data, and there has been no indication that later Grok versions abandoned such sources globally; the reported remedy was limited to EU user data. (apnews.com)
- Reporting on Grok 3 emphasizes synthetic data as an important component, but describes it as in addition to a larger, more diverse dataset rather than a complete replacement of real‑world text. The model is portrayed as mixing synthetic data with real data to improve reasoning and reduce hallucinations, not as being trained solely on synthetic corpora. (rdworldonline.com)
- Musk later said xAI would retrain Grok on a “revised base of human knowledge,” i.e., a re‑edited corpus meant to remove “garbage” and add missing information. That still implies reliance on large human‑authored text collections, just more curated, rather than a pure agent‑generated synthetic dataset. (businessinsider.com)
- Coverage of Grok 5’s planned training states that it will incorporate real‑time data from the X platform to improve relevance and accuracy—a direct continuation of using live, user‑generated social‑media content as part of the training or fine‑tuning pipeline. (grokmag.com)
- xAI’s own materials for Grok 4 and Grok 4.1 talk about large‑scale reinforcement learning and frontier “agentic reasoning” reward models, but they do not claim that the underlying pretraining data has stopped coming from internet and document corpora, and no independent technical source reports such a drastic shift. (x.ai)
Musk has indeed argued that human‑generated data is becoming “exhausted” and that future progress will lean more on synthetic data, which matches the spirit of what Chamath repeated. (theguardian.com) However, available reporting shows that as of late 2025, Grok’s successor models (Grok 4.1/4.1 Fast and the in‑training Grok 5) still depend significantly on conventional human‑authored text from the web, legal documents, and especially X posts, with synthetic data layered on top rather than used exclusively.
Because the prediction was categorical (“not going to be trained on any traditional dataset”), and the ongoing use of real‑world internet and social‑media data is well documented, the prediction has not come true based on what is publicly known, even allowing for some uncertainty about xAI’s proprietary datasets.