Share
Sign In
3️⃣

A Hitchhiker's Guide to AI

Part 3. Telling the Story: Working with AI and Images

🇺🇸 EN / 🇰🇷 KR / 2024.07.22

Speak with pictures: Working with AI and images

Hello, hitchhikers. What kind of surprises are you having this summer? Recently, the field of image generation is heating up with stories about using image references instead of text.
That's why the creator's ability to collect and combine good references is attracting attention.
Welcome to the third chapter of our AI journey, where today we’ll learn how to talk to AI using images.

Hitchhiker Meets Image Language

The Limits of Text: T2I, the Wall We Face

In the previous article, we looked at T2I technology, which generates images from text. However, T2I had limitations. It was difficult to express complex compositions or fine details, and it was not easy to accurately convey our intentions.
Even if you imagine a 'prince on a white horse', the prince drawn by T2I may not be the style we like. We are a little picky. Too pale skin, too dark double eyelids, and sometimes even buttons that are too flashy can be a little burdensome.
Writing an image prompt with concise keywords and an effective structure is not an easy task. If you had to put a fantastic story you encountered in your dreams into 280 characters on Twitter, would that be art or writing? If you think about it, T2I is a really good tool, but it definitely touches on the ironic part.
If you want to create an image, you have to be good at writing!

New Languages, I2T & I2I

Today, we will learn about two technologies that can complement the limitations of T2I: I2T (Image-to-Text) and I2I (Image-to-Image). I2T is a technology that allows AI to interpret and explain images to us, and I2I is a technology that creates new images based on images.
These technologies offer us a new way to communicate with AI through images. We communicate directly with images, without going through the intermediate step of ‘text’.

Reading the World Through AI’s Eyes (I2T)

AI that 'reads' images

I2T technology is like giving AI 'eyes'. AI with a great brain now has eyes to see the world, so it can recognize, analyze, understand, and describe all images as text. It is no longer confined to the world of text. Thanks to its ability to read images, it can also help create images.

The first tool to read images, GPT-4o

Most chatbots these days are equipped with a Vision function that encompasses I2T. This function is a technology that allows AI to directly process and analyze visual information and provide related feedback. It is already available in GPT-4o, Claude, Gemini, etc. In GPT-4o, even free users can use the Vision function.
In the past, the Vision function was mainly loved as an OCR function that recognized and extracted the 'text' part from images or documents, but now it provides various functions such as analyzing and explaining the image itself (I2T) and creating an image from an image (I2I). Soon, it will also analyze videos (V2T).
For example, if you upload an image to GPT-4o, you can now use it in the following way:

Basic usage

1.
Image Recognition: Identify famous or labeled images and provide relevant information. For example, recognize a painting by Van Gogh or a photo of a famous place and provide background information.
2.
Image Analysis: Identify and analyze specific elements of an image, such as its type, structure, and texture. For example, analyze in detail the type of image, its overall structure, the facial expressions of people in the photo, and the texture of objects.
3.
Identifying Style: Recognize and describe the style of a particular painting or artist. For example, identifying the street art style of Banksy using stencil techniques, or identifying the unique style of a particular photographer.
4.
Understanding context: Recognize the relationships between image elements and the overall scene to construct a story in reverse. Example: Infer that a couple is dating from the image of two people standing under an umbrella on a rainy street. Understand that the close proximity of the two and the reflection of the light on the wet pavement create a romantic atmosphere.
5.
Text generation: Express the analyzed information in natural language. Example: "This image shows an old man and children reading a book in a city park."
6.
Prompt suggestions: Provide effective prompts to help you create your image when needed. For example: "Prompt: City park, old man sitting on bench with children gathered around, warm sunlight, peaceful atmosphere, Banksy's stencil technique on concrete wall."
More complex requests, such as those below, are also possible if needed.

Advanced Usage

1.
Please find an element in the image that symbolizes 'hope' or a story that can be inferred from it.
2.
Analyze the mood and emotional tone in the image and suggest an artistic style that would help convey it more effectively.
3.
Analyze the fashion/architectural style depicted in the image and explain it by connecting it to historical context and current trends.
4.
Compare the composition, color, and subject matter of the attached images, and explain the differences in the messages conveyed by each image.
5.
Please arrange the attached images in chronological order and suggest a story.
6.
Title Academy: Please give a title to the cat picture below.
So far, we have been 'conversing with AI blindfolded in chat windows, using text'. And now, we can 'conversing with images while looking at the world like AI'. So right now, show the AI the images you have.
By this summer, we will have to 'upload' prepared image files to the AI, but in the fall, they say they will give GPT-4o eyes so that it can talk in real time. Through the camera lens of our smartphones, AI will look at the world we live in together, and in the not-too-distant future, we will also be able to look at the same place and talk about it in real time through live devices such as Apple Vision Pro.
For a long time, we have been appreciating images alone and interpreting works of art with the help of amateur eyes. Of course, that is also great. However, Hitchhikers can now enjoy images and works of art together with a friend or colleague called AI, have more honest conversations, analyze them with the help of an expert's eyes, and try out approaches and challenges that we had never thought of.
Some people draw the line that art appreciation is a human domain, but that assumes that humans will give up art appreciation because of AI development. Of course, we won't be like that. Art is getting closer.

A second tool for reading images, Midjourney's Describe

Midjourney, an image creation tool, has a feature called Describe. It literally describes and explains an image as text (I2T). To use it, type /describe in the Discord environment and upload an image. It will suggest a prompt for creating that image, and you can also create a similar image right away with that prompt.
In chatbots like GPT-4o, which we looked at above, we can get really diverse and sufficient guides and prompts. However, the prompts that explain with styles and keywords that have already been learned in Midjourney are more useful for generating images, and it is also a really good I2T tool for learning.
In general, the prompts provided in Midjourney Describe are provided in the following order: summary of the entire image, description of details, composition, color and lighting, special effects, style and mood, etc. So, we can infer that the Midjourney prompt structure we use can be written in the same way.
It also shows the names of learned styles and artists, and you can also check out interesting expressions that describe color names, materials, and abstract images in text. The only regret is that it is currently only available in the Discord environment, but it is expected to be implemented in the web service in the near future.
If you don't use Midjourney, you can try Leonardo.ai's free Describe feature, which is almost identical (R1).

Drawing with AI (I2I)

Ending of writing class

We have been talking about images using 'text'. We talked about creating images with text (T2I) or getting text hints from images (I2T). The common problem here is that we use text in the process of creating images. We are writing for art.
So from now on, we're going to treat images as images. Writing class is on hold for a moment, and now it's art class.

I2I : Image-to-Image

I2I (Image-to-Image) technology is an AI technology that takes an image as input and creates a new image. When I use I2I, I imagine the creative process of a skilled artist being inspired by an image and amplifying it into a new work. Sometimes, it feels like putting an image in a vending machine and another image comes out right away.
When I say insert an image here, I mean that I use a 'reference image' that I have. That means that in addition to the chat input window where you enter text prompts, there is an input interface where you upload an image. And it also means that you can apply reference images for sketches, styles, characters, and actions.
To explain the current technology trend, the most notable major tools such as Midjourney, StableDiffusion, Leonardo, Adobe Firefly, Adobe Photoshop, etc. have all started supporting the 'image reference' function. This is because I2I is the most effective tool to overcome the limitations of the existing T2I technology.

I2I Process

This is the part where I define concepts every time I lecture, but the AI image process starts with the following broad divisions.
Creating an image: Referring to an image or creating a new image
Image Editing: Modifying, enlarging, or increasing the resolution of a created image.

I2I classification

And each stage can be further subdivided as follows:

Creating an image

1.
Sketch Reference: Elements that determine the story - Composition, placement, shape and outline, subject and background, etc.
2.
Style Reference: Elements that Determine Beauty - Style, Color, Light and Lighting, Special Effects, etc.
3.
Character Reference: Elements that maintain a consistent character - character appearance, face, clothing, hair, movements, etc.

Image Editing

1.
Inpainting: Modify and replace parts or style inside the generated image.
2.
Outpainting: Creates an image by expanding the canvas of the generated image in the up/down/left/right directions.
3.
Upscaling: Ultimately increasing the resolution of the finished image.
There are more layering possibilities, but the above should be enough as a baseline for explaining I2I.

Image reading practice, layering

Now, let's practice looking at images by layering them according to the criteria above.
In one image, we should be able to separate the 'sketch reference' that determines the story, the 'style reference' that determines the beauty, and the 'character reference'.
Just like the process of restoring a work of art, we will layer it, peeling off the paint layer by layer. First, peel off only the parts of your image where the character is drawn. Then, peel off the parts where the color paint is beautifully painted. Finally, only the sketch lines of the outlines of each element will remain. Every image in the world can be divided into at least three or more image layers like this.
Let's try it the other way around this time. It's really easy if you think about drawing and painting during a fun art class.
We draw or sketch our own story on a blank canvas, using a 4B pencil to draw the outline, outline, and basic shape. Then we paint over it with various colors to complete a beautiful piece. This is the easiest way to distinguish between a 'sketch reference' and a 'style reference'. And if a character appears on your canvas, that part will be a 'character reference'. If you think about it, we have been drawing pictures based on layering since we were young.

Image creation techniques, references

We can borrow this traditional drawing and painting process for AI image generation. We can create images that correspond to the sketch of the original picture with only the story. Or we can create images that are specialized for the style, including only the beauty of the painting style and style, light and lighting, color and special effects. We can also create reference character images to maintain a consistent character with a unique personality.
You can also combine all three reference images into one image.
Or, you can create your own story by borrowing other people's style references or character references. You can change other people's stories to your own style references, or you can recreate famous characters with your own story and style.
Midjourney officially emphasizes this style reference function and recommends using pre-prepared sref numbers and various mixes. There is also a hot community that shares the data and know-how.
You can further divide the layering, or mix only what you need. The important thing is that we have the freedom of choice of image references in addition to text prompts.

Advantages of Image References

When we use image references, we get the following advantages:
You don't have to put everything in the text prompt.
Images can be used to guide sketch elements, such as composition or arrangement, that cannot be explained in text.
Images can be used to guide style elements, such as color or mood, that cannot be explained in text.
You can keep your character consistent across multiple images without having to describe the character's personality elements in text every time.
By utilizing various references, more freedom in creation is possible.

Why share references?

Rather than putting complex content into a single text prompt, you can create it more effectively by dividing the content corresponding to 'story', 'style', and 'character' into references. It's like dividing the work into an overall director, story team, style team, and character team for content production, and adding each team's expertise.
On the other hand, it also plays a role in improving work efficiency. When creating only with text prompts, there are many cases where you have to modify the entire prompt or start over again when there is a problem with a specific keyword. If the prompt is long, it is also difficult to determine which keyword is the problem.
However, by using references, it is easy to check what part of the work process is problematic, and the reusability of references allows for very effective and quick work on utilizing similar styles or composing stories featuring the same characters.

Tools that currently support I2I

The I2I technologies currently supported by major AI tools are as follows. Although the names are slightly different, they all aim for similar functions.

Image Creating: with Reference

Midjourney: Image Prompt, Style Reference(sref), Character Reference(cref)
Leonardo: Image Input _ Style, Content, Character, Depth, Edge, Pose..
StableDiffusion: Depth, Canny Edge, Style Transfer, Pose, FaceSwap..
Adobe Firefly: Composition Structure Reference, Style reference, Effect, Camera Angle
Adobe Photoshop: Reference Image & use Library

Image Editing: In/Out-painting

Midjourney: Repaint, Reframe (Zoom, PAN)
Leonardo: Edit in Canvas (Inpainting, Outpainting)
StableDiffusion: Inpainting, Outpainting
Adobe Firefly: Generative Fill, Generative Expand
Adobe Photoshop: Generative Fill, Generative Expand

Real I2I

Future Technology of I2I

Although not yet popular, the future of I2I technology is already here. It is about adding two key words to I2I: ‘Real-Time’ and ‘Real-Based’.

Real-time I2I technology

Currently, most AI image generation takes about 10 seconds to 1 minute. There are services that change the image in real time when the prompt keyword is changed (R2, R3), but most image generation methods use pre-prepared prompts and image references, and generally generate them one at a time.
However, a real-time service that changes the image on the right screen when you draw with a mouse or pencil on the left screen is also being developed steadily. It still needs to be improved a bit more, but it is a technology that reflects text prompting and image references almost in real time. (R4~R6)
In addition, a method of utilizing real-time rendering by connecting ComfyUI as a plug-in to Adobe Photoshop is being experimentally tested. When drawing elements are placed on the left screen of Photoshop, an image is generated on the right. If it is introduced as an official plug-in or popularized, it will maximize the efficiency of the design process. Since you can continuously change the composition and elements in the left editing panel and check and modify the results in the right rendering panel in real time, it can be said to be the most effective and intuitive I2I method.

Reality-based I2I technology

Currently, most generated images are not based on actual figures in reality. So even interior images that look very realistic are not at a level where they can be directly reflected in actual architectural spaces or interiors. This is because the images are not linked to the size of the actual space, and the AI only draws pictures without any technical understanding of the architectural environment.
So, AI is being developed that automatically converts simple hand drawings into 3D models based on actual numbers, and renders them as images. Measurement-based technology focuses on increasing the accuracy of design. The same goes for fashion design AI, which requires matching models with consumers, and product design AI, which uses prototypes and mockups.
The next step for I2I in image generation tools will be to make them real by connecting them with real-world data from each industry.

Image Language: The Power of Reference

AI image generation technology, which started with T2I (Text-to-Image), has recently made great progress with the stable construction of I2T (Image-to-Text) and I2I (Image-to-Image). In particular, as image reference utilization becomes possible in the most notable major AI environments, the paradigm of image generation is changing significantly and taking hold.
The use of image references allows us to intuitively convey to AI the composition, style, and mood that were difficult to express with text prompts alone. The key here is that it helps to more accurately understand and implement the user's intentions.
In addition, advances in I2T technology have enabled AI to ‘read’ and interpret images, allowing for more efficient feedback and iteration in the image creation process. While T2I was an uncontrolled, random dice game, I2I is now becoming more of an efficient process as tools proliferate.
This combination of technologies is enabling a new creative workflow where text is drafted (T2I), AI interpretation is provided (I2T), revisions are made, and then detailed with specific image references (I2I). And this direction is likely to continue for the foreseeable future.
In the next episode, we will start talking about 'video AI' through T2V (Text-to-Video) and I2V (Image-to-Video). Videos that used to move messily or awkwardly have recently started to move well.
However, the use of 'image reference' is also very helpful in video generation AI. So, I hope you become familiar with I2I in many ways.
I'm spending this summer with Midjourney's style reference (sref), the video creation tool Luma Dream Machine, and Runway's Gen-3. The tools may change, but I think they're the ones that are creating the most important trends right now.
Here's a hint for the next installment: the basic learning patterns for images and videos are similar. Build a foundation with prompts for still images, get used to image references, and add prompts for timelines, and you'll get closer to video AI.
I'll get back to you with some useful articles after organizing my thoughts and skills. Until then, Don't Panic! Have a great summer with AI!
🍀 AI visual director who visualizes imagination and ideas, Mintbear
Reference
R1. bit.ly/LeonardoDescribe Free before creation
R2.
https://clipdrop.co/instant-text-to-image Paid
R3.
bit.ly/LeonardoRealtimeGen Free before creation
R4.
bit.ly/LeonardoRealtimeCanvas Free before creation
R5.
www.krea.ai Paid
R6.
https://github.com/comfyanonymous/ComfyUI Local Installation
A Hitchhiker's Guide to AI: 1️⃣ / 2️⃣ / 3️⃣ / 4️⃣ / 5️⃣ / 6️⃣