All-In Predictions

By late 2025, the prediction has clearly materialized.

OpenAI released GPT-4V in 2023, adding image understanding to GPT‑4 and explicitly positioning multimodal LLMs (beyond text-only) as a major frontier in AI, with vision tightly integrated into the language model rather than as a bolt-on tool. (openai.com) In May 2024, OpenAI launched GPT‑4o, described as a large multimodal model that can natively take text, images, and audio as input and produce text, image, and audio outputs, aimed at more natural, interactive voice and video experiences. (openai.com) OpenAI has since built real-time voice and agent products (e.g., Realtime API and next‑generation audio models) on top of this architecture, enabling practical, complex applications like voice agents that understand speech, see the screen, and act. (openai.com)

Other big tech companies have independently taken the same path: Google DeepMind’s Gemini family, announced in December 2023, is natively multimodal across text, images, code, audio, and video, and is positioned as Google’s most powerful AI offering, integrated into products like Bard/Gemini apps and planned across Search and Ads. (time.com) The broader ecosystem now routinely deploys vision‑language and multimodal models (e.g., in Microsoft Copilot and similar tools) to tackle substantive, real‑world tasks that go beyond what single‑mode text models could handle. (en.wikipedia.org)

Given that (1) within a few years of the 2022 podcast, major advances did come from OpenAI and other large tech companies, (2) these advances are explicitly multimodal across text, audio/voice, images, and in some cases video, and (3) they are used to solve more complex, real‑world problems, Chamath’s prediction is best classified as right.