English
Share
Sign In
👀

Multimodal CoT prompt

Multimodal CoT prompting was revealed in <Multimodal Chain-of-Thought Reasoning in Language Models> by Zhuosheng Zhang's research team in 2023. This is a study that naturally applied CoT to multimodality as the need for multimodal (image, video, audio, etc.) input and output grew stronger.
Multimodal CoT prompts are a new approach that allows language models to make inferences using both text and visual information. The framework consists of two stages: inference generation and response inference. In the first stage, the model processes both text and visual information to generate evidence or inference paths, and in the second stage, it infers an answer to the problem or question based on this evidence.
Actual prompt examples:
"이 두 생명체가 공통으로 가진 속성은 무엇입니까?"
Inference generation: The model observes each object and determines whether each object has certain properties. For example, if you shared a picture of a cat and a dog, both creatures are mammals and have eyes, noses, mouths, fur, teeth, etc.
Response Inference: Based on the evidence generated, the model concludes what properties the two objects have in common.
🤖
The two creatures appear to be a dog and a cat and share the common attribute of being 'mammals'.
This approach enables richer and more nuanced understanding of issues where visual context is important than traditional text-based CoT approaches. Multimodal CoT allows models to handle more complex and diverse tasks, opening up new possibilities, especially in areas where visual information is important.
↔️
🪪
ⓒ 2023. Haebom, all rights reserved.
It may be used for commercial purposes with permission from the copyright holder, provided the source is cited.
👍