All-In Predictions

By the August–November 2024 window, open‑source frontier models had essentially matched OpenAI’s then‑frontier model GPT‑4o on the main academic benchmarks, so Chamath’s claim about benchmark quality parity is broadly borne out (even though OpenAI still had some experiential edge in user‑preference tests and product polish).

Timing and models in scope

In February 2024, OpenAI’s top public model was GPT‑4 / GPT‑4 Turbo, clearly ahead of open‑source models like Llama 2 and Mixtral on standard benchmarks and in Chatbot Arena rankings.
OpenAI released GPT‑4o in May 2024; through mid‑2024 it topped the LMSYS Chatbot Arena leaderboard, beating prior GPT‑4 variants and other proprietary models by a noticeable Elo margin, reinforcing OpenAI’s lead at that time. (arstechnica.com)
Meta released the Llama 3.1 family, including the 405B‑parameter model, on July 23–24, 2024, explicitly positioning it as a frontier‑scale open model. (radicaldatascience.wordpress.com) This is within the 6–9 month window from early February 2024 (and certainly in place by August–November 2024).

Benchmark parity: Llama 3.1 vs GPT‑4/4o

Meta’s and independent write‑ups report that Llama 3.1 405B’s base or chat variants match or slightly surpass GPT‑4/4o on many standard text benchmarks:
- MMLU: Llama 3.1 405B ≈88.6 vs GPT‑4/4o ≈85–88.7 depending on setup. (unfoldai.com)
- GSM8K (math): Llama 3.1 405B ≈96.8 vs GPT‑4o ≈94–96.1. (unfoldai.com)
- IFEval, ARC, and several other knowledge/reasoning benchmarks show Llama 3.1 405B at or above GPT‑4/4o on many tasks, while GPT‑4o retains a small edge on others like HumanEval and some social‑science MMLU subsets. (unfoldai.com)
Meta’s own human evals vs GPT‑4o show roughly comparable quality: Llama 3.1 405B wins ~19% of comparisons, ties ~52%, and loses ~29% against GPT‑4o—i.e., the majority of interactions are ties, with only a modest edge for GPT‑4o. (unfoldai.com)
Multiple analyses and news pieces at the time describe Llama 3.1 405B as “competitive with the best closed‑source models” and note that it meets or exceeds GPT‑4o on several headline benchmarks, calling it “one of the best and largest publicly available foundation models” and “the first open‑source frontier model” that can beat closed models on various metrics. (aws.amazon.com)

Mistral and other open models

Mistral Large (API‑hosted but trained on web data) launched in February 2024 and was already close to GPT‑4 on MMLU and other benchmarks, though typically a few points behind GPT‑4 on broad general‑knowledge tests (e.g., MMLU ~81 vs GPT‑4 ~86). (dailyai.com) This supports the broader pattern Chamath described: non‑OpenAI models rapidly closing the gap on standard evaluations.

Does this eliminate a “meaningful quality advantage”?

On common academic benchmarks that dominated 2022–2023 discourse (MMLU, GSM8K, HumanEval, etc.), the gap between OpenAI’s frontier model (GPT‑4o) and top open models (especially Llama 3.1 405B) had shrunk to low‑single‑digit percentage points by late July 2024, with leadership flipping back and forth depending on the specific test. (unfoldai.com)
Industry reporting at the time explicitly framed Llama 3.1 as “on par with” or “competing head‑to‑head with” GPT‑4o and Claude 3.5 Sonnet, rather than a clear tier below, which is consistent with Chamath’s claim that the former OpenAI advantage on those benchmarks had largely disappeared. (aws.amazon.com)

Caveats

User‑preference leaderboards like LMSYS Chatbot Arena continued to show OpenAI models (GPT‑4o and successors) at or near the top, with open models slightly lower—indicating that in real‑world “vibes‑based” comparisons, OpenAI still retained a modest edge, especially in overall polish and non‑text modalities. (arstechnica.com)
OpenAI later released GPT‑4.1 in April 2025, re‑establishing a clearer performance lead over Llama 3.1 405B on several newer benchmarks like SWE‑Bench and global MMLU, but that falls outside the August–November 2024 window. (aimodels.fyi)

Given that by late 2024 open‑source frontier models (most notably Llama 3.1 405B) were benchmark‑competitive with GPT‑4o—often within a point or two, sometimes ahead, sometimes behind—Chamath’s specific prediction that OpenAI’s earlier benchmark advantage would be gone within 6–9 months is best judged as right, with the nuance that OpenAI still held a small but real edge in many human‑preference and product‑level evaluations.