Sign In
👀

Multimodal CoT Prompt

Multimodal CoT prompting was first introduced in the 2023 paper <Multimodal Chain-of-Thought Reasoning in Language Models> by Zhuosheng Zhang's research team. As demand for multimodal (images, video, audio, etc.) input and output has grown, CoT has naturally been applied to multimodal research.
Multimodal CoT prompting is a novel approach that enables language models to reason using both text and visual information. This framework consists of two stages: inference generation and response inference. In the first stage, the model processes both textual and visual inputs to generate evidence or reasoning paths; in the second stage, it draws conclusions or answers to questions based on that evidence.

Example prompts:

"이 두 생명체가 공통으로 가진 속성은 무엇입니까?"
Inference generation: The model observes each object and judges whether it has certain characteristics. For example, if you show pictures of a cat and a dog, both are mammals and have features like eyes, noses, mouths, fur, and teeth.
Response inference: Based on the evidence generated, the model deduces which attributes the two objects share.
🤖
These two creatures appear to be a dog and a cat, and they both share the attribute of being 'mammals'.
This method provides a richer and more nuanced understanding for problems where visual context is important, compared to traditional text-based CoT approaches. Multimodal CoT empowers models to tackle more complex and diverse tasks, especially opening up new possibilities in fields where visual information is critical.
↔️
🪪
ⓒ 2023. Haebom, all rights reserved.
May be used for commercial purposes with the copyright holder's permission and with proper source attribution.
👍