All-In Predictions

By late February 2025, there was substantial but conflicting evidence about whether Grok 3 had truly become the state‑of‑the‑art general‑purpose LLM relative to OpenAI/Microsoft’s best publicly available frontier models.

Evidence that supports the prediction

xAI launched Grok 3 around February 17–20, 2025, describing it as their new flagship model trained with roughly 10× the compute of Grok 2 and claiming it surpasses OpenAI on benchmarks like AIME (math) and GPQA (PhD‑level science). (es.wikipedia.org)
xAI’s published benchmarks and coverage in outlets like Beebom and ZeroHedge report Grok 3 outperforming GPT‑4o, Claude 3.5 Sonnet, Gemini 2.0 Pro, and DeepSeek V3 on AIME 2024, GPQA Science, and LiveCodeBench, and also beating OpenAI’s o3‑mini on some reasoning benchmarks. (beebom.com)
Grok 3 (under the alias “chocolate”) reached the #1 position on the LMSYS Chatbot Arena with an Elo score around 1400–1402, ahead of GPT‑4o and DeepSeek R1, which xAI and supporters framed as evidence it was the top chatbot overall. (twitter.com)

Evidence that cuts against the prediction

Independent commentary after launch emphasized that Grok 3 was competitive but not clearly dominant. Ethan Mollick (Wharton) described Grok 3 as a “very solid frontier model” but not a clear leader and “not one you would stop using your current frontier model for,” adding that while it beats some OpenAI models on selected benchmarks, it does not clearly surpass OpenAI’s o3. (aol.com)
Gary Marcus similarly argued that Elon Musk’s promise that Grok 3 would be “the smartest AI ever” was not borne out, calling the launch “no game changer” relative to OpenAI’s best models. (aol.com)
Comparative write‑ups on Grok 3 vs. OpenAI’s o3 report a mixed picture: Grok 3 slightly leads on some math benchmarks (e.g., AIME 2025 under heavy consensus sampling), while o3 (and related OpenAI reasoning models) lead on others, such as Codeforces coding Elo and certain software‑engineering tasks. These articles also note concerns that xAI’s benchmark setups (e.g., very expensive consensus sampling for Grok 3) aren’t perfectly comparable to how OpenAI models are typically evaluated, making it hard to declare an overall winner. (portotheme.com)

Why this is rated ambiguous The prediction effectively claims that by January/February 2025 Grok 3 would surpass OpenAI/Microsoft’s best publicly available frontier model and be the state‑of‑the‑art general‑purpose LLM. By late February 2025, Grok 3:

Was clearly a top‑tier frontier model and #1 on one prominent user‑preference leaderboard (Chatbot Arena). (twitter.com)
But faced credible, well‑publicized expert assessments and technical comparisons saying it did not clearly surpass OpenAI’s leading models overall.

Because “state‑of‑the‑art general‑purpose LLM” is not defined by a single universally accepted metric, and high‑quality sources disagree—some framing Grok 3 as SOTA and others explicitly saying OpenAI remained ahead or at least not clearly behind—the outcome of the prediction cannot be determined in a definitive, objective way. Hence the result is best characterized as ambiguous, rather than clearly right or clearly wrong.