About Movie Gen
What is Movie Gen?
Movie Gen is a next-generation generative AI foundation model from Meta that covers multiple modalities including video generation, audio generation, and video editing. The system includes a 30 billion-parameter model for text-to-video generation (and image) and a 13 billion-parameter model for audio, enabling creation of short realistic videos (up to ~16 seconds) with synchronized sound effects and music. Beyond simple generation, Movie Gen supports personalization via a user’s image (allowing that person to appear in the video while preserving identity), editing of existing footage via text instructions (e.g., inserting objects or changing scene elements), and integration of audio tracks matching the content (e.g., background music, sound effects). Although released as research, Meta has noted that public developer access is not currently planned, and future rollout will focus on collaboration with creative and entertainment industry partners. Use-cases span creative content generation, filmmaking prototyping, advertising, immersive media, and internal tools for media production workflows, offering creators a new way to produce visual-audio content from simple prompts without needing extensive traditional filming or editing infrastructure.
How to use Movie Gen?
To get started with Movie Gen, visit their website and create an account. Once you're set up, explore features like Text-to-Video Generation, Audio Generation & Synchronization, Video Editing via Text.
What Are the Key Features of Movie Gen?
Generate high-definition video clips (up to ~16 seconds long) from natural language prompts, with reasoning about object motion, camera movement, and subject-object interaction.
Create background music, sound effects and audio tracks synchronized with the generated video content (up to ~45 seconds of audio) to enhance immersion.
Edit existing video clips by giving text instructions (e.g., insert objects, change surface, modify scene elements) to modify footage in a controllable way.
Upload a photo of a person to generate a video where that person appears in the scene, preserving identity and motion while placing them in new scenarios.
Built with a transformer-based architecture (≈30 B parameters for video, ≈13 B for audio) trained on licensed and publicly available datasets, and optimized via joint text-image and text-video objectives.
Support for various aspect ratios (1:1, 9:16, 16:9) and resolutions up to full HD (~1080p) while generating coherent motion and audio.
