Gemini Omni Flash Brings Multimodal AI Video Creation To Google

Published by

Author Connect2Connect

May 21, 2026

Google has introduced Gemini Omni, a new multimodal AI model capable of creating and editing videos by combining text, images, audio, and video prompts.

This announcement was made at Google I/O 2026, where the company emphasized Omni’s ability to integrate multiple media types seamlessly.

Google highlighted Omni as a significant advancement in evolving Gemini into a fully creative AI system that can both understand and generate various forms of media.

Gemini Omni Flash Powers Next-Gen Video Creation

The initial version of the model, called Gemini Omni Flash, is currently rolling out through the Gemini app, Google Flow, and YouTube Shorts.

The model combines Gemini’s advanced reasoning capabilities with AI-powered content creation tools.

This integration allows users to create cinematic-quality videos using simple natural language prompts.

One of the standout features of Gemini Omni is its conversational video editing capability.

Instead of relying on traditional editing tools or complex timelines, users can simply describe the changes they want in natural language.

This approach makes video editing more intuitive and accessible, allowing creators to edit content through simple conversations with the AI.

Realistic AI Video Editing

Google demonstrated examples in which users transformed sculptures into bubbles, turned mirrors into liquid, added animations, and modified environments within video clips.

The AI was able to preserve characters, realistic physics, and overall scene continuity while applying these changes.

According to Google, each new command builds upon previous edits, allowing users to refine videos across multiple prompts without losing consistency.

The company also stated that Gemini Omni has a stronger understanding of movement, lighting, gravity, fluid dynamics, and object interactions.

This improved understanding helps the model generate scenes that appear more lifelike, visually coherent, and physically accurate.

Multimodal Creative Tools

Google states that Gemini Omni can process multiple input types simultaneously. Users can upload photos, existing videos, drawings, voice references, and text prompts to create a unified output.

For example, users can apply the visual style of one image to a video or synchronize visuals with music.

The model can also generate cinematic clips using rough sketches and written instructions.

In addition, the system is capable of creating educational explainers and animated sequences from simple prompts.