Last updated Nov 29, 2025
The next major advance in AI, likely within the next several years, will be the emergence of powerful multimodal models (combining video, audio, text, and other data) from a big tech company or OpenAI, enabling solutions to more substantive, complex problems than current single-mode models.
The next big leap, and I think it will come from one of the big tech companies or from OpenAI is... a multimodal model, which then allows you to actually bring together and join video voice data in a unique way to answer real, substantive problems.View on YouTube
Explanation

By late 2025, the prediction has clearly materialized.

OpenAI released GPT-4V in 2023, adding image understanding to GPT‑4 and explicitly positioning multimodal LLMs (beyond text-only) as a major frontier in AI, with vision tightly integrated into the language model rather than as a bolt-on tool. (openai.com) In May 2024, OpenAI launched GPT‑4o, described as a large multimodal model that can natively take text, images, and audio as input and produce text, image, and audio outputs, aimed at more natural, interactive voice and video experiences. (openai.com) OpenAI has since built real-time voice and agent products (e.g., Realtime API and next‑generation audio models) on top of this architecture, enabling practical, complex applications like voice agents that understand speech, see the screen, and act. (openai.com)

Other big tech companies have independently taken the same path: Google DeepMind’s Gemini family, announced in December 2023, is natively multimodal across text, images, code, audio, and video, and is positioned as Google’s most powerful AI offering, integrated into products like Bard/Gemini apps and planned across Search and Ads. (time.com) The broader ecosystem now routinely deploys vision‑language and multimodal models (e.g., in Microsoft Copilot and similar tools) to tackle substantive, real‑world tasks that go beyond what single‑mode text models could handle. (en.wikipedia.org)

Given that (1) within a few years of the 2022 podcast, major advances did come from OpenAI and other large tech companies, (2) these advances are explicitly multimodal across text, audio/voice, images, and in some cases video, and (3) they are used to solve more complex, real‑world problems, Chamath’s prediction is best classified as right.