If you had a bunch of associates It's bang on some law model for a year. Again, that's that reinforcement learning we just talked about. I think you'd get precision recall off the charts and it would be perfectView on YouTube
Chamath’s claim was that, with about a year of intensive associate feedback (“bang on some law model for a year”), a legal-domain model’s precision and recall would be “off the charts” and effectively perfect, suitable for high‑stakes legal use.
Since early 2023, the closest real‑world tests of this hypothesis have been specialized legal LLMs (Harvey, Lexis+ AI, Westlaw AI tools, etc.) that have indeed had heavy use and feedback from thousands of lawyers over well more than a year.
-
Specialized legal tools are far from near‑perfect.
- A preregistered Stanford study of leading legal research tools (Lexis+ AI, Westlaw AI-Assisted Research, Ask Practical Law AI, GPT‑4) found that even the best system (Lexis+ AI) hallucinated 17–33% of the time and answered only about 65% of queries accurately—nowhere near “perfect” precision/recall. (arxiv.org)
- A 2025 comparative evaluation likewise found Lexis+ AI to have 58% accuracy and ~20% fabricated responses, with other tools doing worse—again inconsistent with “off‑the‑charts” reliability suitable for unsupervised high‑stakes use. (cambridge.org)
-
Legal‑specialized vendors acknowledge substantial remaining error.
- Harvey’s own BigLaw Bench results show its assistant model completing about 74% of a lawyer‑quality work product on complex legal tasks and achieving a 68% “Source Score” for correctly sourced answers, with the company explicitly noting “substantial room for improvement,” not perfection. (harvey.ai)
- In a follow‑up post, Harvey reports a low but non‑zero hallucination rate (~1 in 500 claims, 0.2%) on its internal benchmark—impressive, but still not “perfect,” and limited to its own task distribution. (harvey.ai)
-
The “one‑year of associates” condition is effectively met in practice.
- Since late 2022, firms such as Allen & Overy (now A&O Shearman) have had thousands of lawyers using Harvey—over 3,500 lawyers making ~40,000 queries just in the early beta—providing exactly the kind of intensive, expert feedback Chamath described. (arstechnica.com)
- Harvey then collaborated with OpenAI on a custom case‑law model, tested with 10 major law firms, and tuned specifically to reduce hallucinations; lawyers preferred its outputs 97% of the time to baseline GPT‑4, yet even Harvey presents this as a major improvement in reliability and relevance, not as achieving perfection.
(openai.com)
In other words, the industry has roughly executed the scenario Chamath imagined—sustained legal‑expert RL on top models for more than a year—without reaching anything close to universally “perfect” legal performance.
-
Courts, bar associations, and vendors still treat AI as non‑trustworthy for unsupervised high‑stakes work.
- Stanford and ABA‑linked work on “Hallucinating Law” finds that general LLMs frequently hallucinate in core legal reasoning tasks, with hallucination rates 69–88% overall on certain benchmarks and at least 75% when asked about a court’s holding, concluding that current models are “not yet capable of the nuanced legal reasoning required” for tasks like evaluating precedential relationships. (americanbar.org)
- Professional guidance and commentary from Thomson Reuters, law firms, and ethics authors consistently stress that AI systems are tools, “not a substitute for a lawyer”, and that human oversight is “critical” because outputs remain fallible and prone to hallucinations. (thomsonreuters.com)
- Courts across multiple jurisdictions have sanctioned lawyers for filings containing fake AI‑generated citations, with judges explicitly warning that “no reasonably competent attorney should outsource research and writing” to AI without verification. (markets.chroniclejournal.com)
Given that:
- Highly‑invested, domain‑specialized systems have had more than a year of intensive feedback from lawyers,
- Rigorous empirical studies show substantial error and hallucination rates, far from near‑perfect precision/recall, and
- The legal profession and courts still require human review and explicitly warn against relying on these tools in high‑stakes matters,
Chamath’s prediction—that a year of associate‑driven RL would yield effectively perfect, high‑stakes‑grade legal performance—has not materialized. The technology has advanced dramatically, but empirical results and professional practice clearly contradict the level of reliability he forecast.
So the forecast is wrong rather than merely “inconclusive.”