A new video model from Google: VideoPoet

It can generate videos based on text descriptions. But it is not based on a diffusion model, but is itself an LLM that can understand and process multimodal information and incorporate them into the video generation process.
Not only can it generate videos, but it can also add stylized effects to videos. It can also repair and expand videos, and even generate audio from videos.
One-stop service…
For example, VideoPoet can generate a video based on a text description, or convert a still picture into a dynamic video. It can also understand and generate audio, and even write code for Media Processing Service.
This multimodal learning capability makes VideoPoet more flexible and powerful in video generation, capable of handling more complex and diverse tasks.

Demonstration video:

The VideoPoet model generates videos in the portrait direction by default, which is mainly to meet the needs of Short Video content. To demonstrate its capabilities, the Google Research team created a short video generated by VideoPoet, consisting of many short clips.
To make the short film, the team first used Bard to write a short story about a traveling raccoon. Bard not only provides a scene breakdown of the story, but also lists the hints that accompany each scene. These tips are used to guide VideoPoet in generating video clips that match the story.
This process demonstrates VideoPoet’s diversity and creativity in video content creation. By combining different technologies and tools, such as Bard’s story-creation capabilities and VideoPoet’s video generation capabilities, it is possible to create imaginative and engaging visual content.
This approach opens up new possibilities for video production and storytelling, and is especially suitable for producing Short Video and social media content.

The main functional characteristics of VideoPoet:

1. A wide range of video generation tasks: VideoPoet can handle a variety of video generation tasks, including text to video, image to video, video styling, video repair and extension, and video to audio.
2. Multimodal learning capabilities: Unlike video generation models that are mainly based on diffusion, VideoPoet, as a large language model, demonstrates excellent learning capabilities in multiple modalities, including language, code and audio.
3. Integrate multiple video generation capabilities: VideoPoet integrates multiple video generation capabilities in a single large language model, rather than relying on components trained separately for each task.
4. Task design: VideoPoet can adjust its generation process according to different task requirements (such as text to video, image to video, etc.). Each task type is indicated by a specific task tag to guide the model in corresponding video generation.
5. Long video generation: Through continuous prediction, VideoPoet can generate longer videos. It extends the video by considering only the last part of the video (such as the last second) in each step and then predicting what will follow.
6.、Interactive video editing: Allows users to interactively edit videos, such as changing the actions or behavior of objects in the video. This is achieved by adding new text prompts to the input video.
7. Image-to-video control: You can animate the input image according to text prompts and edit its content.
8. Camera motion control: By adding specific camera motion descriptions (such as zoom, pan, arc shooting, etc.) to text prompts, it can realize these camera motions in the generated video.

Working principle:

VideoPoet is based on the Large Language Model (LLM), which combines multimodal learning and autoregressive models.
VideoPoet uses the Big Language Model (LLM) for processing and generating text, but is trained to understand and generate video and audio.
Combined with multimodal learning, VideoPoet can process multiple types of input and output (such as text, images, video and audio). It can combine different types of information (such as text descriptions and image content) to create new video content.
Autoregressive model: It relies on previous steps for every step in generating video. In this way, it can gradually build the entire video to ensure the consistency and consistency of video content.
Encoding and decoding of video and audio: To process video and audio, VideoPoet uses special encoders (such as MAGVIT V2 and SoundStream) and decoders to convert this content into a format that the model understands, and then converts the resulting content back into a visual or audible format.

Detailed introduction:https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html
Demonstration:https://sites.research.google/videopoet/

Demonstration video:

Video: