All-In Predictions

The quote and context check out: in the July 12, 2024 All‑In episode, Sacks says Glue’s LLM‑powered results are “kinda like a B plus right now” and adds that “in a year or so… the next gen version of all the models… At that point, it’s going to be an A plus.”(podscripts.co)

Within roughly a year, his “next gen” condition did occur. OpenAI released several successor models to GPT‑4, including GPT‑4.5 in February 2025 and GPT‑4.1 in April 2025, followed by GPT‑5 and GPT‑5.1 later in 2025; these were marketed and benchmarked as substantially more capable and reliable than GPT‑4/4o, with improved coding, instruction following, and lower hallucination rates on certain tests.(en.wikipedia.org)

However, whether that constitutes "A+" practical quality for Glue‑style applications is inherently subjective and not directly measurable from public data:

No objective Glue metrics – Public coverage of Glue describes its AI as a virtual employee layered on top of Slack/Teams‑like chat, powered by models such as ChatGPT and Claude, but there are no later public benchmarks or statements from Sacks quantifying that its outputs have become “A+” versus the earlier “B+.”(techcrunch.com)
Hallucinations and reliability remain open issues – Even by late 2024 and through 2025, peer‑reviewed work and industry analysis still describe hallucinations as a “critical barrier” for enterprise use, with state‑of‑the‑art methods often failing to exceed ~80% factual faithfulness on some benchmarks.(arxiv.org) Studies and reporting in 2025 note that newer reasoning models like OpenAI’s o3 and o4‑mini can hallucinate in 30–50% of cases on certain tasks, and major outlets emphasize that leading LLMs still generate incorrect or fabricated content often enough to demand human verification.(livescience.com)
Mixed real‑world sentiment – Aggregated user‑review data shows that mentions of hallucinations in reviews of leading chatbots (ChatGPT, Gemini, Claude, etc.) dropped markedly between March and October 2025, but did not disappear; non‑trivial shares of users still report accuracy and hallucination concerns.(learn.g2.com)

Because “B+ vs A+” is Sacks’s own informal grading with no standardized threshold, and there is no transparent data on how Glue’s actual output quality changed relative to that scale, yet frontier models clearly did improve while still exhibiting meaningful errors, the prediction cannot be judged definitively true or false from available evidence. That fits best under “ambiguous” rather than “right” or “wrong.”