Sign In

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ํ™•์‚ฐ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ •๋ณด ํ™•์‚ฐ์œผ๋กœ ์ธํ•œ ์ด๋ฏธ์ง€ ์˜์—ญ ๊ฐ„ ๊ฐ„์„ญ ๋ฐ ๋ฌธ๋งฅ์  ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐฉํ–ฅ์„ฑ ์žˆ๋Š” ์ •๋ณด ํ™•์‚ฐ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ Cyclic One-Way Diffusion (COW)๊ณผ Selective One-Way Diffusion (SOW)์€ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์กฐ๊ฑด ์ถฉ์‹ค๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ด๋ฏธ์ง€ ์ „๋ฐ˜์˜ ์‹œ๊ฐ์ , ์˜๋ฏธ์  ์ผ๊ด€์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค. ํŠนํžˆ SOW๋Š” MLLM์„ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ ๊ด€๊ณ„์— ๋”ฐ๋ผ ํ™•์‚ฐ์˜ ๋ฐฉํ–ฅ๊ณผ ๊ฐ•๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ํ•™์Šต ์—†์ด๋„ ์ ์‘์ ์ด๊ณ  ๋ฒ”์šฉ์ ์ธ ์ƒ์„ฑ ๋ชจ๋ธ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐ์กด ํ™•์‚ฐ ๋ชจ๋ธ์˜ ๋ฌด์งˆ์„œํ•œ ์ •๋ณด ํ™•์‚ฐ์œผ๋กœ ์ธํ•œ ๋ฌธ์ œ์ ์„ ๊ทน๋ณตํ•˜๊ณ , ํ…์ŠคํŠธ-๋น„์ „-์ด๋ฏธ์ง€ ์ƒ์„ฑ(TV2I) ์ž‘์—…์—์„œ ํ”ฝ์…€ ๋‹จ์œ„ ์กฐ๊ฑด ์ถฉ์‹ค๋„์™€ ์ „์—ญ์  ๋ฌธ๋งฅ ์ผ๊ด€์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
MLLM์„ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋‚ด ์˜๋ฏธ์ , ๊ณต๊ฐ„์  ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ™•์‚ฐ ๊ณผ์ •์„ ์ œ์–ดํ•˜๋Š” Selective One-Way Diffusion (SOW)์€ ํ•™์Šต ์—†์ด๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ์œ ๋งํ•œ ๋ฐฉ๋ฒ•๋ก ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ํ•™์Šต ์—†์ด๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, MLLM์˜ ์ดํ•ด ๋Šฅ๋ ฅ์ด SOW์˜ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ๋” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ์ถ”์ƒ์ ์ธ ๋ฌธ๋งฅ์„ ๋‹ค๋ฃจ๋Š” ๋ฐ ์žˆ์–ด ์„ฑ๋Šฅ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘