Share
Sign In
4️⃣

A Hitchhiker's Guide to AI

Part 4. Video Palette: Next Scene Drawn by AI (T2V,I2V,V2V)

🇺🇸 EN / 🇰🇷 KR / 2024.08.11

👨‍🎨 Video Palette: Next Scene Drawn by AI (T2V,I2V,V2V)

You have arrived at the video constellation.

Dear hitchhikers, our AI journey has already arrived at the video content constellation. When we open our eyes in the morning, we start the day with our smartphones, and when we end the day with YouTube shorts before going to bed, we are now on a spaceship called video, traveling through time through the space of information and emotions.
In the past, movies and TV programs were the domain of broadcasting stations, directors, and experts. The same was true of newspapers. At 9 p.m., there was a time when the whole family gathered in the living room to watch the news, and at school or work, everyone talked about weekend dramas. It’s hard to explain to the current MZ generation.
But what about now? Your smartphone is already a movie theater and can also be a movie production studio at the same time. Can you imagine? Whether you feel it or not, video generation AI is at the center of this change.
Now, let’s take a look at how AI video technology is creating a new constellation of content. Don’t Panic!

[1] T2V, I2V, V2V: Three Engines of AI Video Production

Our spacecraft is equipped with three very powerful engines: Text-to-Video (T2V), Image-to-Video (I2V), and Video-to-Video (V2V).
To bring the imaginary movie to life, you can use text prompts to describe it, images to provide references, or even existing videos that have already been prepared. These three engines, which all use text, images, and video, each have their own strengths, but they are most powerful when they work together to create a synergistic drive.

The Three Roles of the Video Team

From another perspective, T2V, I2V, and V2V technologies each perform their own roles, much like a film production team with a single purpose.
Screenwriter, T2V (Text-to-Video): T2V, which creates video from text, is like a screenwriter. It draws the big picture and creates the skeleton of the story. It cannot express every detail perfectly, but it is excellent at providing an overall direction. T2V's strength lies in the rapid implementation of ideas and visualization of concepts. It summarizes complex narratives into simple explanations, and AI unfolds them into videos. However, it has limitations in accurately expressing detailed visual elements or complex changes. It is difficult to perfectly convey subtle nuances or specific visual styles with text alone. Therefore, it needs to be supplemented with other technologies.
Art Director, I2V (Image-to-Video) : I2V, which creates videos from images, acts like an art director or cinematographer. It provides specific visual references and determines the look and feel of the video. It is a hassle to prepare each scene in detail, but it is essential for creating a unique visual identity. The biggest advantage of I2V is that it can accurately reflect the creator's visual intention. This is because it can directly present detailed visual elements that are difficult to express with T2V as images. It can convey a specific video style, specific scenes and compositions, and special effects points quite accurately. However, it also has the disadvantage that it may require a lot of reference images to create a feature-length video with a consistent style. And the process of extracting dynamic elements from still images can produce unexpected results.
Video-to-Video (V2V) : V2V is the final stage video editor and special effects expert. It has the ability to reinterpret and transform existing footage. It is still in its early stages, but it has the potential to completely restructure footage or transform it in real time. Currently, V2V is mainly used to change the style of a footage or modify specific elements. It can do things like transform live-action footage into animation or change weather conditions. But the potential of V2V goes far beyond that. According to Sora’s presentation, it will be possible to change characters or backgrounds while maintaining the narrative of the overall footage, or to expand and composite the style and content of short clips to create completely different footage.
When this feature is properly implemented and synergized, it will create a new paradigm in video production. It will enable a workflow where T2V can explain the initial idea and storyline, I2V can define specific visuals and style, and V2V can refine and expand the entire video.
Hitchhikers, if you've made it this far, it's probably time for us to go spaceship shopping.

[2] New Spaceships: A Feast of AI Video Tools

This time, we introduce the spaceships that will make our video journey even more exciting. These video-generating tools are equipped with the T2V and I2V engines that we covered earlier, and they all have unique performances and amplify our creativity.
Then, shall we go spaceship shopping together?
🍀
You'll need to get used to the language of video going forward, so if you come across any unfamiliar terms below, please read on to see which tools are worth paying attention to.

Boardable spaceship

First, let me introduce Runway's Gen-3. Gen-3 is a versatile tool that supports both text and image prompts, and conveniently supports keyframes and various functions. It is evaluated as the most powerful tool when the camera movement and motion brush functions supported in the previous version, Gen-2, are updated. Users can produce powerful camera movements and special effects (VFX) and create stylized videos with just a simple prompt input. For example, if we input a prompt such as "fast drone tracking shot", Gen-3 can create a video of a drone moving quickly and filming. Complex effects and transitions can also be applied with simple commands. As of the date of writing, it is evaluated as the most powerful tool available to the public.
Here is the Luma Dream Machine, which implements the 3D world. Luma also creates videos from text and images, and is quite good at presenting subjects in a cinematic style from various angles and lighting conditions. However, the flat color tone that is not colorful and the CG image style that cannot overcome the uncanny valley are seen as relative weaknesses.
Another powerful tool, Kling, is the strongest competitor to Gen-3 based on a huge amount of data learning. It already supports keyframes, camera movements, and master shots, and is preparing to introduce a motion brush function in the future, so future growth is expected. In the learned area, the quality has already surpassed Gen-3, and the consistency is good, and the fact that negative prompts can be used is a really strong factor. However, the dark tone produced in most videos is a bit disappointing. Nevertheless, as of the date of writing, I can recommend Kling and Gen-3 as the most effective tools.
But that doesn’t mean that the above spaceships guarantee perfect travel. There are still problems such as inconsistency, distorted subjects, failure to apply real-world physics, and failure to properly reflect user intent. It’s as if the problems that image creation tools faced a year ago are being reproduced in the video domain. But the pace of progress seems to be faster. We’re overcoming the steps that image creation tools have overcome a little faster.
So, I think hitchhikers can now prepare themselves mentally.

Spare spaceship

Here are two next-generation spacecraft scheduled for launch in the second half of the year. First, Google's VEO. VEO is appealing to video generation technology based on world simulation and can generate videos up to one minute long. The released videos show that it maintains a consistent storyline with short prompts, simulates difficult physical situations, and naturally creates complex scenes.
OpenAI's Sora is also in preparation. Several artists' works are being released, and it also boasts a world simulation video generation technology. Sora's strength is that it can naturally maintain complex changes and subject consistency based on large-scale data learning. The difference that these tools show is that they simulate and reproduce reality in a virtual space called video based on the physical laws of reality. The virtual world that we have been talking about for a long time is being implemented in the form of video content.
Sora and VEO are still shrouded in mystery, from the camera movement features that should be supported together, to the creation UI, asset management, and user pricing. Gen-3 is not cheap either, but it will be interesting to see if Sora and VEO are available to the general public at a reasonable price, and these spaceships are worth checking out for their amazing features and cost-effectiveness before shopping.

Before boarding the spaceship

Hitchhikers, these new AI video spaceships, each with their own unique video film style and pros and cons, are opening up entirely new markets for the YouTube and massive video content constellations you love to watch.
The reason I'm introducing these spaceships, or tools, with unfamiliar terms is because we have to choose between being a 'YouTube consumer in bed' or a 'content creator.' In a way, we may be given a red pill or a blue pill.
Soon, the time will come when everyone will use video language in their daily lives without difficulty. When the era of illiteracy is over and everyone can enjoy images, the next step will be video.
The countdown has begun.

[3] The Birth of the Digital Double: The Online Self Created by AI

Hitchhikers, we now enter the mystical space called 'Digital Persona', the core area of the content constellation.

Selection Disability Area: Avatar Factory

Shopping for spaceships for your Avatar in this area can be a bit confusing.
Last year, avatar creation, face-swap, and lip-sync tools started with HeyGen and D-ID, and have been followed by Alibaba's EMO, Microsoft's VASA-1 technology, and more recently, Synthesia, Hedra, and Live Portrait, creating a huge battlefield. Video lip-syncing capabilities have also been introduced in Gen-3.
The functions of these spaceships are so diverse that it is difficult to summarize them in one word,
In general: It receives text/image/video (T2V, I2V, V2V) as input and creates an avatar similar to it, or creates an image-based face video. And the core of these services is that 'the avatar reads the input message in various voices, and the mouth shape and movements of the face video are really natural.'
For example, HeyGen lets you give multilingual presentations using your face and voice, while Live Portrait lets you transform old family photos or works of art into living, speaking and smiling people.
The reason this field is hot is because the 'persona' as a contact point and speaker who handles my content and delivers it to consumers in the content market is really important. But on the other hand, I think it's also another expression of the desire to live as someone else. It's really interesting.

A world where choices are possible

Now you can choose. You can have the sentences you want read out loud, in the voice you want, with the face you want, with natural mouth shapes and movements, and in different languages from all over the world. This is the core function of the Avatar ships in this area.
I can no longer be me, and I can become someone else and tell a story. Wouldn't it feel like the individuality and uniqueness of 'me' that we humans have maintained for thousands of years is being liberated and completely dismantled?
For more interesting and safe hitchhiking in the future, we can check the following manuals.
Expansion and redefinition of the identity of 'me'
Overcoming space and time constraints
The changing value and role of labor
Ethical Challenges and Legal Issues
Digital Literacy and Critical Thinking
New horizons of creation
Reconstructing human relationships and trust
In this video journey that simulates reality, not only the 'content' is being simulated, but also 'we'. It could be myself, a sub-persona that contains a part of me, or the creation of a completely new persona.
AI video technology seems to be asking us to go beyond simple video generation and redefine our online identities.

[4] Video Prompt: New Grammar Drawn by AI

Humans have continued to express their ideas in richer and more vivid ways. From cave paintings to creating paints, establishing painting styles, documenting the present through photographs, recreating reality through film, and now through AI technology, our communication methods have constantly evolved.

Evolution of Expression: From Text to Image

Over time, human expressions have become more intuitive and powerful, with each language having slightly different powers.
[Written language] It is the most powerful tool to record and convey our thoughts. It is said that the eradication of illiteracy made the masses the protagonists of history. Although there were still barriers between different cultures and languages, in the AI era, those boundaries are gradually becoming blurred. Recently, we have entered the era of real-time translation and interpretation, which is the starting point of global communication that mankind has never experienced before.
[Image Language] Sometimes, a single picture can move and persuade more people than a thousand words, and change history. It works even when the language used is different. A single photo in journalism stopped a terrible war, and the photo called Pale Blue Dot made us reflect on our existence in the vast universe. Now, everyone captures everyday moments with their smartphones and tells their stories through those images. The advent of image-generating AI has lowered the threshold for artistic expression, and more people are participating in image creation.
[Video Language] It is the most compressed form of data, and it is a three-dimensional language that places both written language and image language on a timeline and reproduces reality most vividly. Now, anyone can test and utilize the high-level video language that was used only by experts in the video industry in the past. I think the foundation has been opened for our small stories to be expressed and conveyed in a richer and more colorful way.

Video Prompts: Grammar of Video Language

Video prompt is another name for the video language that compresses the complex video production process into simple text. To convey the video language to AI, the following core elements are added to the general image prompt structure.
Narrative: Describes the skeleton of the video story.
Camera Work: Specifies the point of view, frame, and movement.
Special Effects: Enables changes and expressions beyond reality.
Timing and Pacing: Control the rhythm and flow of your video.
The language of images is the capture of a static moment. And the language of video is the innumerable written and image languages placed on a dynamic timeline.
So the camera, which had been still, starts moving back and forth, up and down, and fast and slow. Then, the subject moves and changes within the frame. Sometimes it flashes momentarily or has special effects from a fantasy world, and sometimes a new narrative is created through the connection of completely unrelated scenes.
We call the AI language that generates videos by adding dynamic elements that change to static image prompts video prompts.
So next time, we're going to explore more deeply how to effectively combine and utilize different video prompts, and what new expressive possibilities this opens up.
For those of you who are curious, you may want to first check out Mintbear's [Video Prompt Book] prepared for Gen-3.

[5] Infinite Palette: The next scene we will draw

Advances in AI video technology are quietly but surely changing our daily lives. From video generation using T2V, I2V, and V2V to digital personas, these technologies are increasingly penetrating our lives. What will the next scene of this sci-fi movie-like reality look like?

The popular video language

Filmmaking, once the domain of experts and capital, is now becoming an everyday activity that anyone can do. The elements of filmmaking that were once expensive - actors and tens of thousands of extras, trained film crews and editing lines, large-scale studios, overseas location scouting, high-difficulty VFX technology and CG, and even drones and underwater cameras - can now be tested with AI. With simple text prompts instead of complex editing software, and with AI's visualization ability instead of expensive camera equipment, we can now easily realize our imaginary ideas into videos.
At the same time, the use of video language will greatly expand the public's visual literacy. I feel it strongly during my training. From short videos for social media to promotional videos for personal branding, more people will tell their own stories in their own video language.

Efficient experimental spirit

Experimentation usually involves a certain amount of trial and error and inefficiency. But AI-powered video prompting can dramatically increase the speed and efficiency of creation. With a single prompt, you can create multiple versions of a video simultaneously, testing out different styles, narratives, camera positions, and special effects. It allows you to visualize your ideas instantly, giving all creators unprecedented freedom and possibilities.
This efficiency fundamentally changes the creative process, reducing the cost of trial and error and allowing for bolder, more innovative attempts.

Personal persona

AI avatar technology gives us the opportunity to create digital avatars. These digital personas can replicate our appearance and voice, speak in multiple languages, and be active 24 hours a day. They can be used in a variety of fields, such as personal branding, customer service, and remote education. When combined with real-time response technology, they are even more powerful.
This technology is extending our presence into the digital world. As individuals can assume multiple roles through multiple personas, our ideas about identity and self are changing. At the same time, we are also thinking about issues such as authenticity, trust, and digital ethics.

Reproducing and extending reality

I think that the medium of video reproduces reality while at the same time expanding reality with new interpretations. Just like what is an everyday video log to someone can suggest a different life to someone else through a new interpretation.
The news I shared today was mainly about creating great videos with video prompts and the literacy of technology that can transform personas in various languages. However, in the future, with the AI Vision function, everyday videos will be reproduced and expanded in real time, and the time when they will be infinitely multiplied and consumed in virtual spaces like a multiverse is not far away.
In an age where anyone can easily speak the language of video, we think more about reality, virtuality, and ourselves.
What will the next scene be like with this new palette? Hitchhikers, let's explore together and open up another world.

Video prompt

In the next five parts, we’ll dive deeper into video prompts to see how you can use this technique in practice. We’ll analyze real videos, categorize different prompts, and look at key keywords.
Until we meet again, Don't Panic!
🍀 AI visual director who visualizes imagination and ideas, Mintbear
Gen-3 Video Prompt Book ➡️ https://slashpage.com/gen3
A Hitchhiker's Guide to AI: 1️⃣ / 2️⃣ / 3️⃣ / 4️⃣ / 5️⃣ / 6️⃣