I think they're all going to converge to the same quality in the next, probably 12 to 18 months.View on YouTube
By the end of the 12–18 month window (roughly Feb–Aug 2025), the gap between major labs’ models had narrowed but had not fully converged to “about the same quality,” and there were models with large, clear leads over others.
Key points:
-
Meta’s Llama 3.1 405B did reach near-parity with earlier GPT‑4-class models on many classic benchmarks, e.g. MMLU, GSM8K, HumanEval and MGSM, often matching or slightly beating GPT‑4o and Claude 3.5 Sonnet on individual tests.(llamaai.online) This is strong evidence for partial convergence between Meta and OpenAI on older, static benchmarks.
-
However, on widely used third‑party human‑preference benchmarks like LMSYS Chatbot Arena, Llama and Mistral still trailed the frontier. Llama 3.1 405B’s text‑arena Elo is around 1260–1270,(rankedagi.com) whereas frontier closed models sit much higher (e.g. GPT‑4.5 and later GPT‑5 variants in the mid‑1400s, Gemini 2.5/3 Pro and Anthropic Claude 4.x similarly high).(analyticsvidhya.com) That ~150–200 Elo gap corresponds to a large win‑rate difference, contradicting the idea that all models are at roughly the same level.
-
Mistral’s best general models also remained noticeably weaker than top OpenAI/Google/Anthropic/xAI models on aggregate benchmarks. Independent leaderboards and evaluations put Mistral Large 2 at about 81% MMLU and a substantially lower Arena Elo than GPT‑4‑class systems, which are ~88–89% on MMLU and rated much higher in human preferences.(trustbit.tech) This again suggests a clear, measurable quality gap rather than full convergence.
-
xAI’s Grok 3, released Feb 2025, is explicitly reported as surpassing other frontier models on several hard benchmarks (AIME, GPQA, LiveCodeBench) and holding the top or near‑top Elo on Chatbot Arena (≈1400+), ahead of GPT‑4o and other leading systems.(twitter.com) That gives xAI a clear lead over Meta’s Llama and Mistral’s models on third‑party, human‑preference metrics, directly contradicting the claim that no model would have a large, clear advantage.
-
New reasoning‑centric benchmarks introduced in 2025 show substantial spread, not tight clustering. For example, AI4Math finds OpenAI’s o3‑mini and DeepSeek R1/V3 above 70% accuracy on challenging university‑level math, while Llama 3.3 70B and GPT‑4o‑mini are below 40%.(arxiv.org) A separate cross‑lingual study on Cantonese/Japanese/Turkish reports that GPT‑4o and Claude 3.5 lead, while Llama 3.1 and Mistral Large 2 lag significantly in fluency and accuracy.(arxiv.org) These independent academic benchmarks show that models from these labs do not sit at a single, indistinguishable quality level.
-
Methodology critiques (e.g., of Chatbot Arena) and claims that open models “are catching up” do not erase the observed quantitative gaps. Papers and articles note that Arena can be gamed and that open vs. closed performance gaps have shrunk to roughly a one‑year lag, with Llama 3.1 reaching parity with earlier GPT‑4 variants.(time.com) But they still describe a meaningful frontier edge for the very best proprietary models over Llama and Mistral, and a strong lead for certain reasoning models (OpenAI’s o‑series, Grok) in 2025.
Taken together, the evidence shows directional convergence (gaps shrank, especially Meta vs. OpenAI on older benchmarks), but not the full convergence with no clear leader that Chamath predicted. There remained sizeable, well‑documented quality differences among OpenAI, Meta’s Llama line, Mistral, and xAI’s Grok as of mid‑ to late‑2025, so the prediction is best judged as wrong.