Multimodal CoT Prompt

Multimodal CoT prompting was first introduced in the 2023 paper <Multimodal Chain-of-Thought Reasoning in Language Models> by Zhuosheng Zhang's research team. As demand for multimodal (images, video, audio, etc.) input and output has grown, CoT has naturally been applied to multimodal research.

Multimodal CoT prompting is a novel approach that enables language models to reason using both text and visual information. This framework consists of two stages: inference generation and response inference. In the first stage, the model processes both textual and visual inputs to generate evidence or reasoning paths; in the second stage, it draws conclusions or answers to questions based on that evidence.

Example prompts:

"이 두 생명체가 공통으로 가진 속성은 무엇입니까?"

Inference generation: The model observes each object and judges whether it has certain characteristics. For example, if you show pictures of a cat and a dog, both are mammals and have features like eyes, noses, mouths, fur, and teeth.

Response inference: Based on the evidence generated, the model deduces which attributes the two objects share.

These two creatures appear to be a dog and a cat, and they both share the attribute of being 'mammals'.

This method provides a richer and more nuanced understanding for problems where visual context is important, compared to traditional text-based CoT approaches. Multimodal CoT empowers models to tackle more complex and diverse tasks, especially opening up new possibilities in fields where visual information is critical.

May be used for commercial purposes with the copyright holder's permission and with proper source attribution.

Made with Slashpage