English
Share
Sign In
🐎

Understanding image and video creation

We must know that the word diffusion itself means spreading.
Naturally(?), image or video generation is more expensive than text generation. To be exact, the number of tokens used is large. If you understand the concept of tokens as segments of words or letters, it will be more complicated, so you don't actually need to know. The general reason is as follows.
Complexity and size of data: Images and videos contain much more data than text. For example, an image is made up of thousands or tens of thousands of pixels, each containing information about color and brightness. A video is a series of these images over time. Text, on the other hand, has a much simpler structure, made up of characters.
Processing time and cost: Generating and modifying images and videos requires a lot of calculations. This requires high-performance computer resources, which increases the cost. Text generation is relatively simple calculations, so it can be done with less computing resources.
Complexity of the learning process: Image and video generation models must recognize and understand various shapes and patterns. This requires a much more complex learning process than text. Text generation is mainly focused on learning the rules and structure of language, which is relatively simple compared to visual data.
There are other reasons, but the main point is that image generation inevitably consumes computing power (performance). So while small and medium-sized language models can run fairly well even on old computers, image generation is difficult to use smoothly unless you pay a high cost to process it in the cloud (AI Image Profile) or don’t have a high-performance processing processor (GPU).
If you want to know more about the principles, I recommend watching the video below or studying keywords such as CNN and GANs.
📇
🥷
ⓒ 2023. Haebom, all rights reserved.
It may be used for commercial purposes with permission from the copyright holder, provided the source is cited.