A Hitchhiker's Guide to AI

Part 2. Imagine the Hitch Hiker: How to Turn Your Imagination into Images

🇺🇸 EN / 🇰🇷 KR / 2024. 07. 01

Today, I'm going to tell you about the second guide for AI hitchhikers, Text-to-Image (T2I).

Humanity, let go of your imagination

The starting point of all stories and images in the world is our 'head' or 'imagination'. We have taken out the stories and imaginary images floating around in our heads and conveyed them to the world in various ways. We tell them in words, draw them, write them down, or make them into videos.

The history of mankind has also developed in this order. From expressing emotions and intentions with simple sounds, to drawing cave paintings, to writing in hieroglyphs and ideograms, to language and culture developing, to the sophisticated style of painting, and then, after a little more time, to making movies. And what happens next?

To summarize briefly, the stories and images in the imagination of mankind have expanded and evolved from simple sound forms (I2S: Imagine-to-Sound), to increasingly detailed expressions through pictures (I2I: Image), letters (I2T: Text), and video (I2V: Video).

The Hitchhiker's Technique

We, the descendants of humans, also express our imaginary images in various ways. However, AI image technology is developing in a slightly different order. This is because it is easier to create images based on specific information implicit and learned in text keywords than to extract information from complex human voice conversations.

So the first method to implement was to generate images with text prompts (T2I), and recently, methods such as real-time drawing or referencing other images as references have also been implemented (I2I). And the technology to generate images in real time while talking to AI is also gradually approaching.

The way humans bring their imagination into reality has been steadily evolving since the Lascaux cave paintings, but recent AI technologies are like hitchhiking, so sudden that no one is properly prepared.

So, let’s take a look at these techniques one by one, just like stretching to prepare for a full-on workout. (Hitchhikers, remember that our AI journey is just beginning.)

Text-to-Image (T2I) Technology

The first basic technology we're introducing is Text-to-Image (T2I) , which is a technology where AI generates images based on text prompts entered by the user.

For example, you can generate a beautiful image by entering a text prompt like "flying dolphin". This is used in almost all image generation platforms like DALL·E 3 or Midjourney, and thanks to this, many people have easily experienced this singularity technology.

T2I principle

How exactly does one create an image from text?

Starting from noise : In fact, all images start from a state of 'noise' consisting of random color dots. A random base image is prepared that looks like a 'static' screen of a television with no signal, and it is also called SEED because it is the starting point (seed) of all images.

Input prompt : Now let's add your imagination here. How about typing "flying dolphin"?

Understanding the prompt : First, the AI figures out what images the keywords “sky,” “fly,” and “dolphin” in your prompt have. Then, based on the learned data, it analyzes and understands what images each word represents in the overall context, and figures out the user’s intent.

Image creation begins : The AI gradually changes the colors of the dots in the noise state to create an image. At first, it is just noise, but in the process, it gradually takes on more specific shapes and colors, as if adding paint to a canvas.

Adding Details : The AI goes through several stages to refine the image, gradually. Once the outline is established, it fills in the details, adding color and texture to complete the style and mood.

Completed Image : The AI will eventually complete the image that matches the prompt. As if by magic, random colored dots will become a beautiful picture. And you will see a dolphin flying in the sky.

[GIF] The process of converting noise into an image by receiving a text prompt (Midjourney)

Your imagination changes the meaningless points of the world one by one and completes them into an image with a beautiful story, just like the coincidence of countless stars gathering in the night sky to create a picture.

T2I Advantages

Have you ever imagined a more beautiful image than the one generated above? Or, after seeing the generated image, did you find yourself able to imagine something else? In fact, for me, the most amazing part of T2I was the visual inspiration that went beyond imagination.

But beyond just providing inspiration, this technology offers real-world opportunities for our hitchhikers.

High-quality images: Even if you don’t have much hand-drawing skills, you can easily create professional-quality images by simply inputting prompts. No need for long training sessions.

Cost-effective: At a low cost, non-experts can get the same benefits as professional designers. This is the most attractive option for budget-conscious startups, small businesses, and individual creators.

Diverse content: In the process of transforming and reproducing various images, we are naturally exposed to a lot of artistic ideas and various forms of content creation. The moment we start using it, we become creators.

Quick visualization: Visualize complex ideas quickly and in a variety of ways with prompts. Useful for creative and fast-paced users who want to try rapid development and reorganization.

Pre-Training: After using T2I, you will move on to creating 3D images and videos. The experience of using T2I itself is the basic training for all future visual content creation.

This technology, which realizes the image of imagination into reality, is not just a simple tool, but also a powerful means or weapon to reconstruct reality. It is the most realistic supporter who best understands my imagination and visually presents it, and a perfect partner who silently accompanies me. What a friend in the world.

T2I Limits

However, expressing an 'imaginary image' in the form of a 'text prompt' or a 'clear keyword' is actually a difficult task.

Usually, the images in our imagination are mostly unclear and blurry. In fact, when talking to hitchhikers, there are many cases where they don't know what images they are creating. One of the reasons for this is that educational programs that help with image imagination are not as popular as writing training.

Even if you are a hitchhiker with the ability to imagine clear images, you need another ability to express them as effective text prompts. You need both 'visual imagination ability' and 'verbal expression ability', and you even need to accurately use 'specific keywords' labeled in AI images according to an efficient 'prompt structure'.

For example, “vivid colors” or “rainbow colors” might be better than “colorful,” and keywords like “aqua-blue, lemon-yellow, coral-pink” might be clearer instructions to AI. Use keywords that are as specific as possible. However, if you can’t think of a specific color, delegating the aesthetics to AI using the keyword “colorful” is also a great methodology.

But even when precise keywords are used: keywords can be ambiguous, function differently in context, and even conjure up completely different images across cultures.

For example, “apple” can mean a fruit, or it can mean a corporate brand known for the iPhone. And if you use a poetic prompt like “red like a ripe apple,” you’re more likely to draw an apple than a deep red one. And the prompt “traditional wedding” will conjure up a wide variety of images depending on the context and culture.

Writing a long prompt does not necessarily mean that a better image will be generated. Most AI image generation models have a maximum length that can be validly input, so long prompts can de-emphasize important words or make less important words sound more prominent. Since sentences at the beginning of a sentence are generally more weighted, it is better to place descriptions of important elements at the beginning. If possible, short and clear prompts are more effective.

On the other hand, there are abstract and emotional elements that cannot be explained in text. There are also keywords that have not yet been learned. T2I is an imperfect tool, so we should give up the desire to explain everything in text. There are also clear limitations in describing delicate color tones or subtle emotions in text.

It is really difficult to want to give detailed direction to background elements other than the subject, or to explain the arrangement of countless chairs in a classroom one by one with prompts. It seems like it would be possible to specify the arrangement of objects or coordinate values one by one, but such a function is not yet supported, and the process of explaining all elements of the image is close to labor.

So we also need the know-how to delegate to AI with simple directional keywords, such as prompts like 'well-organized chairs' or 'chairs arranged at a round table'.

Finally, there is also the problem of AI generating biased images. If the dataset that the AI model learns from contains human biases toward certain cultures, races, or genders, then the images it generates will reflect those biases.

For example, keywords such as “doctor” or “professor” may generate images of mostly white men, or keywords such as “beauty” may only reflect women or certain concepts. Most platforms are adjusting the bias of images, but there is still a long way to go. Of course, strategies that reversely utilize these biased or typical keywords to generate images can also be very effective methods.

If you are having trouble getting something to work as you want on platforms like Dali or Midjourney that generate images from text (T2I), try finding a little hint in the story above.

T2I Running

So, to effectively utilize text-to-image (T2I) technology, we can use the following methods:

Prompt Sentence Placement: Write a sentence that summarizes the entire image first, then put sentences about the subject and background in that order. Then add details, color and lighting, special effects, mood, etc. Use no more than seven sentences if possible.

Prompt Structure: Use adjective+noun or subject+verb+object English sentence structures that AI can easily understand. Not only does a patterned sentence structure help create intuitive images, but it also helps users edit and reuse prompts later.

Describe in order: If you have trouble describing an object, try describing it in order, starting from the top left to the bottom right.

Use specific keywords: Look for specific, detailed keywords instead of vague words. There are definitely keywords that create a clearer image. I just don't know.

Keep it short and clear: Keep your prompts concise and express the important elements clearly. Long sentences cause the AI to miss important information.

Emphasize key elements: If there are important elements, place them at the beginning of the overall prompt to emphasize them, add specific descriptions, and list similar keywords to guide the AI to focus on the target. Conversely, avoid describing unimportant elements.

Clarify context: If you must use an ambiguous keyword, guide it clearly through context.

Use Direction Keywords: When there is a lot to describe, don't explain everything one by one, but provide comprehensive and clear directions like a director. And trust the AI.

Use culturally-based keywords: Try using keywords that visually capture diverse cultural backgrounds.

If you are considering long-term learning, there are three competencies you can consider:

Make your imagination concrete: Try to visualize the image in your mind clearly.

Expressing imagined images in text: Try to clearly describe the imagined images in text language.

Develop a visual eye: Develop an eye for selecting the best results or edits from the images generated.

Since the images generated by T2I technology can never match our imagination, a regeneration process is necessary to find the desired image by modifying the prompt. Therefore, a visual eye is really important to quickly select the best image among the countless images generated by B-Cut in this process and make effective decisions for development.

Long-term learning goals are not easy, but I think it is a skill that I would definitely recommend to hitchhikers who see far ahead.

In addition, there are various samples available on various websites including Midjourney, but I think it is the best learning material. It is the most appropriate because it learns the correlation between the input prompts and the output images that are actually used. Also, the ability to utilize the GPT-4o chatbot, which helps us in various ways in all technical and emotional areas, is really important.

These short-term methods and long-term learning strategies can help you use text-to-image (T2I) technology more effectively, and you’ll get slightly more sophisticated and satisfying results.

T2I, the next technology

That's it for today. In the next episode, we'll cover the latest technologies after Text-to-Image (T2I). We have technology ready to describe and analyze images (I2T), and technology ready to create new images by referencing existing images (I2I).

Just as dividing tasks and providing them sequentially to new employees who are struggling with a lot of work increases work efficiency, AI also does a much better job of creating images when you divide the task into stages and ask it to do so. It was a similar principle to writing concise text prompts and providing them in short sentences.

While we relied solely on text prompts to generate images in T2I, in the future we will provide a more intuitive and effective way to generate images by utilizing various image references.

In addition to text prompts, you can also set up sketch references, style references, and character references to provide clear creation guides. Let’s learn about how to request AI to create images in a way that divides up the work.

So, until next time when it will be a little easier, Don't Panic for anyone struggling with writing image prompts.

🍀 AI visual director who visualizes imagination and ideas, Mintbear

A Hitchhiker's Guide to AI: 1️⃣ / 2️⃣ / 3️⃣ / 4️⃣ / 5️⃣ / 6️⃣

Made with Slashpage