Google debuts Gemini Omni: Multimodal AI that creates anything from any input

By admin | May 19, 2026 | 4 min read

When Google introduced Gemini three years ago, the ambition was to create a multimodal large language model—a unified neural network trained on text, images, audio, and video, capable of generating content in any of these formats. Today, at the Google I/O developer conference, the company took a concrete step toward that vision with Gemini Omni, a new family of multimodal models. According to CEO Sundar Pichai, Omni will be able to "create anything from any input."

The rollout begins with video capabilities. Users can now combine images, audio, video, and text, and rather than simply stitching these inputs together, Omni reasons across all of them to produce a cohesive output. The result is high-quality videos that demonstrate an understanding of physics, culture, history, and science. Omni also allows users to edit photos using plain text commands instead of complex editing software, similar to Google’s Nano Banana. While Google already has a dedicated video model called Veo, which lets users turn text and images into videos and customize avatars, DeepMind’s director of product management, Nicole Brichtova, emphasizes that today’s release is more than just an update to Veo. "It’s the next step towards the progression of combining the intelligence of Gemini with the rendering capabilities of our media models," she said.

During a media briefing on Monday, DeepMind’s chief technologist, Koray Kavukcuoglu, offered an example: when Omni was given a simple prompt like "a claymation explainer of protein folding," it quickly rendered a stop-motion video with a voice-over explaining, "Proteins start as chains of amino acids. They fold into patterns like the alpha helix and flat sections called beta sheets, forming a perfect three-dimensional shape."

The long-term vision for Omni is even broader, involving the model generating images from audio or audio from video. "When we first announced Gemini, it was our first AI model to be natively multimodal," Pichai said during the briefing. "We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction."

As part of the release, users will also be able to create videos featuring their own digital avatars—a feature popularized by OpenAI’s now-defunct Sora app through Cameos. To prevent deepfakes, users must go through a dedicated onboarding process, which involves recording themselves and reciting a series of numbers, according to Brichtova. The avatar is then stored for future use. Additionally, all videos created with Omni will include Google’s SynthID digital watermark, enabling users to verify whether videos were generated using Gemini products. The first model in the family, Gemini Omni Flash, will roll out today to the Gemini app, YouTube Shorts, and the AI creative studio Flow. Flash can render up to 10 seconds of video. Brichtova notes that this isn’t a model limitation but a deliberate decision to make it more accessible, anticipating that most users won’t want to create longer videos just yet. However, longer durations are planned for the near future. Google appears to be positioning Omni Flash as a consumer tool. Barth-Maron put it simply: "They’re like personalized memes."

"We definitely did focus on making this easy to use for consumers," Brichtova said. "Not many video models have breached that chasm with consumers, so this is our play to do that."

The ease of use comes with a caveat: Brichtova and Barth-Maron noted that editing prompts will need to be highly specific, otherwise Omni risks over-editing or unintentionally altering elements the user wanted to keep—a problem Nano Banana users would have encountered.

Despite the near-term consumer focus, Omni’s enterprise and creative implications are clear. Google will make Omni available via API in the coming weeks. The avatar-generating tool—already available today on Shorts—is expected to be adopted by content creators. More broadly, an end-to-end multimodal workflow could be transformative for advertisers and filmmakers. Startup Luma AI is developing a similar agentic tool that can generate an entire ad campaign from a short brief and a product image, powered by its own "unified" model. "We’re actually pretty proud of the model’s text-rendering capabilities, which is really useful for things like advertising," Brichtova said. "If you want a product somewhere, or even just a slogan, it needs to be accurate… We definitely anticipate filmmakers and other kinds of creators are going to be using this model as well."

The more professional use cases might be better served by the Omni Pro model, which should perform better across all Omni tasks. Google hasn’t announced a release date for Pro yet, but Brichtova said it will happen when "we feel like we’re at a point where we have a step change above Flash."