Share
Sign In
1️⃣

A Hitchhiker's Guide to AI

Part 1. Don't panic, hitchhiker

🇺🇸 EN / 🇰🇷 KR / 2024.05.29

Thanks to AI, the planet is really busy

With the announcement of GPT-4o, Google I/O, and MS Copilot+PC, humanity on Earth is having a busy schedule this month (May, 2024). Thanks to generative AI, we have entered a world where we can create songs, have voice conversations with AI, and easily create videos from images. Let’s take a look at where this exciting world is headed.

To the hitchhikers traveling through AI

As I lived, I lived in 2023 and I am still living in 2024, and so I met AI. AI was not the subject of my life's research, but I am having fun talking about it and studying it with enthusiasm. Since it is an unexpected scenario, I caught a spaceship called AI that happened to pass by, and it felt like 'this is like hitchhiking'.
As I was thinking about what to call us, I thought that since we met AI by chance and started our journey, the story of 'The Hitchhiker's Guide to the Galaxy (2005)' would be a good fit.

Multimodal and Omni AI

Unlike last year, most major AI services in 2024 are oriented toward multimodal and omni. Multimodal means that AI can process various types of data such as text, images, sounds, and videos at the same time, and omni is GPT-4o's proposal to comprehensively handle all types of data on a single platform and provide feedback in various forms at the same time.
It's now common for AI to respond to text, and it can analyze images, sounds, videos, programming code, data analysis, and anything else you ask it to do in real time, providing feedback in a variety of forms.
You can even make verbal requests (GPT-4o voice interface, VoiceMode).
This change is a completely new way of doing things, and it can be very confusing for our hitchhikers. For immigrants who are used to using only a keyboard and mouse, it can be a bit overwhelming to figure out how to make their voices heard in this new space. Even if the microphone is on, they wonder when it should be turned on.
It is at this very moment that we will start from the basics of AI today. We will look at how data is converted between different languages such as text, images, sound, and video.

A Hitchhiker's Guide to AI

There are various data types in generative AI. If you are a member of the 'native AI generation', you may already be using them naturally without the complicated distinctions below. However, if you are a member of the 'hitchhiker generation' who is more familiar with analog or digital, it would be good to become a little more conscious about it.
Here's your first guide to understanding multimodality.
Hitchhiker's Guide 1: List of AI Multimodal Transformations
AI Multimodal Transformation List
Image
1
Text-to-image
T2I
Text-to-Image
Generates an image based on a text prompt.
(Example: Dali3, Midjourney)
2
Image-to-text
I2T
Image-to-Text
Analyze images to generate text descriptions.
(e.g. Midjourney /describe function)
3
Image-to-image
I2I
Image-to-Image
Create new images by transforming or stylizing existing images (e.g. Stable Diffusion, Midjourney Style Reference)
Video
4
Text-to-video
T2V
Text-to-Video
Generate videos based on text prompts (e.g. Gen-2, Pika, Sora, Veo)
5
Image-to-video
I2V
Image-to-Video
Generate continuous video using images as a source (e.g. Gen-3, Pika, EMO, MS VASA-1)
6
Video-to-video
V2V
Video-to-Video
Convert or auto-edit the style of your video to create a new video (e.g. Hey-Gen, A1111, Domo)
7
Video-to-text
V2T
Video-to-Text
Analyze the content of your video to generate a text description.
Sound
8
Sound-to-text
S2T
Sound-to-Text
Generate text descriptions by analyzing sound or speech (e.g. Clova Notes, ChatGPT Voice Mode)
9
Text-to-sound
T2S
Text-to-Sound
Generate sounds, voices, and music based on text descriptions (e.g. Suno, Udio, ElevenLabs)
Have you tried it? Of the nine AI features above, to what extent have you used them?
Those who have created images with Dali3 or Midjourney have experienced step 1, those who have tried image-based analysis in a chatbot or used the /describe function in Midjourney have experienced step 2, and those who have used the Stable Diffusion or Midjourney Reference function have experienced step 3, which is the forefront of image creation.
Moving on to the video area, those who have used Gen-2 or Pika have experienced steps 4 and 5, and those who have converted the style of videos with Hey-Gen, Domo, etc. have experienced the initial stage of step 6. Video analysis technology for step 7 is being prepared in GPT-4o, etc., and those who have converted meeting recordings into Clova Notes have experienced step 8, and those who have created their own music or voices with Suno, Udio, ElevenLabs, etc. have experienced step 9.
However, these multimodal experiences are now being processed in a complex manner, entering a stage where they produce more powerful synergy effects. GPT-4o is said to understand the user's emotions and intentions in the data provided, and provide real-time analysis combined with online search results. In fact, we will experience the next stage of technology before we even get used to the previous guide technology.
So I think that you don't have to understand every single element, and you don't have to experience every single technology. All technologies are becoming more and more convenient, and they're becoming so easy that you don't have to be conscious of boundaries.
But I think it's good to be aware of change.

The boundaries have become colorless

In fact, this article started as a simple plan to capture the multimodal process of converting different types of data into an 'AI tool'. In order to improve the hitchhikers' experience of using multimodal AI.
But as I write, I realize that this guide is becoming a bit of a thing of the past. I am moving towards a stage where I don't have to be conscious of that boundary.
We prepared a unique folder name on our computer and stored completely incompatible JPG image files, HWP Hangul files, and PPT presentation files separately in it. And the file extensions remained the same as where we initially saved them.
However, MS Copilot+PC is promising that in the future, we will be able to recall all the records we used on the computer, and GPT-4o will be able to convert different types of data very easily. To some extent, we may not need to organize it, and we may not have to worry about the format of the data at all.

How is it evolving?

Let’s take ‘image generation AI’ as an example.
Text-to-image (T2I) technology, which generates images with prompts that you are familiar with, has already reached the next stage. In the beginning, you had to input complex prompts directly to generate images, but since last year, we started getting help with prompts from ChatGPT, and with Stable Diffusion XL Turbo, we have reached the early stages of real-time image-to-image (RI2I) stage, which uses image and character references from Midjourney and generates images in real time by reflecting hand drawings.
However, GPT-4o is said to analyze all the visuals of reality coming into the camera through the Vision function, accept multi-prompts of intent including emotions through the user's facial expressions and voice, and generate data, images, videos, or sounds as desired.
Not only will users be able to directly bring their imaginations into the world as text and images, but AI will also be able to identify the emotions and intentions contained in the user's facial expressions and voice to help create images, which may result in videos or music rather than images.
As summer rolls on and Sora and Veo launch, as well as Midjourney Video in the second half of the year, the way we create videos will change dramatically.
In an era when artists, designers, directors, and producers primarily led art and produced images and videos, the time is now coming when anyone, be it a teacher, self-employed person, consumer, or volunteer, can 'easily take out and use' images, sounds, and videos as needed.

Don't panic, hitchhiker

Many of you have probably seen the demo videos of GPT-4o, where it acts as the eyes of the visually impaired, teaches math directly to students, provides real-time interpretation, and understands my hidden emotions through conversation.
How did you feel? Did you feel like Marvin, the gloomy robot in the movie?
In 'The Hitchhiker's Guide to the Galaxy', the representative humor code "Don't Panic" is often displayed throughout crisis situations. In fact, without any countermeasures. Or because it was going to happen anyway.
I think it would be good to convey the same message at the beginning in 'The Hitchhiker's Guide to AI' that I prepared today.
“Don’t panic. We will adapt.”
🍀 AI visual director who visualizes imagination and ideas, Mintbear
A Hitchhiker's Guide to AI: 1️⃣ / 2️⃣ / 3️⃣ / 4️⃣ / 5️⃣ / 6️⃣