All-In Predictions

Sacks predicted that OpenAI (or similarly leading AI labs) would eventually extend multimodality beyond text, images, and audio to include video.

That has happened:

OpenAI developed Sora, a text‑to‑video model capable of generating realistic short videos from text prompts. OpenAI first previewed Sora’s capabilities in February 2024, then moved it out of research preview and launched it publicly on December 9, 2024 as a product for ChatGPT Plus/Pro users, enabling users to generate up to 1080p, 20‑second videos. (openai.com)
The Sora interface explicitly supports prompting with text, images, and videos, and can extend or remix existing clips, making video both an input and an output modality within OpenAI’s ecosystem—exactly the kind of next‑step multimodality Sacks anticipated. (openai.com)
OpenAI has since released Sora 2 (September 30, 2025), which adds synchronized dialogue and sound effects, further cementing video (with audio) as a core modality alongside text and images. (openai.com)
Beyond OpenAI, other top labs followed the same trajectory: Google integrated its Veo video‑generation models (Veo 2, later Veo 3) into Gemini/Gemini Advanced, letting users generate short, high‑resolution videos from text (and sometimes images), which confirms the broader industry move Sacks was pointing to. (blog.google)

Because OpenAI and comparable leading providers did, in fact, add video as a supported modality within roughly 1–2 years of the November 2023 podcast, the prediction is right.