Sign In

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM)์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์œ„์น˜ ์ •๋ณด์˜ ์ž˜๋ชป๋œ ๊ฒฐํ•ฉ์œผ๋กœ ์ธํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Circle-RoPE๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Circle-RoPE๋Š” ์ด๋ฏธ์ง€ ํ† ํฐ์˜ 2D ์ขŒํ‘œ๋ฅผ ํ…์ŠคํŠธ ์œ„์น˜ ์ถ•์— ์ง๊ตํ•˜๋Š” ํ™˜ํ˜• ๊ณต๊ฐ„์œผ๋กœ ์žฌ๋งคํ•‘ํ•˜์—ฌ, ๊ฐ ํ…์ŠคํŠธ ํ† ํฐ์ด ๋ชจ๋“  ์ด๋ฏธ์ง€ ํ† ํฐ๊ณผ ๋™์ผํ•œ ๊ฑฐ๋ฆฌ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ๊ณต๊ฐ„ ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ์›๋ฟ” ํ˜•ํƒœ์˜ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Circle-RoPE์˜ ๋ถ„๋ฆฌ๋œ ๊ธฐํ•˜ํ•™๊ณผ ๊ธฐ์กด RoPE์˜ ๊ทธ๋ฆฌ๋“œ ๊ธฐ๋ฐ˜ ์‚ฌ์ „ ์ •๋ณด๋ฅผ ๋ ˆ์ด์–ด๋ณ„๋กœ ๊ต์ฐจ ์ ์šฉํ•˜๋Š” AGE ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์—ฌ, ๊ต์ฐจ ๋ชจ๋‹ฌ ์œ„์น˜ ์ •๋ณด์˜ ๋ถ„๋ฆฌ ๋ฐ ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ์„ธ๋ฐ€ํ•œ ๊ณต๊ฐ„ ๊ตฌ์กฐ ๋ณด์กด์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ต์ฐจ ๋ชจ๋‹ฌ ์œ„์น˜ ์ •๋ณด ๋ถ„๋ฆฌ: Circle-RoPE๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํ† ํฐ ๊ฐ„์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ, ๊ธฐ์กด RoPE์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ž˜๋ชป๋œ ์ƒ๋Œ€์  ์œ„์น˜ ํŽธํ–ฅ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ธฐํ•˜ํ•™์  ์‚ฌ์ „ ์ •๋ณด์˜ ํ™œ์šฉ: ํ™˜ํ˜• ๊ณต๊ฐ„ ์žฌ๋งคํ•‘๊ณผ ๊ต์ฐจ ๋ ˆ์ด์–ด ๊ธฐํ•˜ํ•™์  ์ •๋ณด ํ™œ์šฉ(AGE)์„ ํ†ตํ•ด, ๋ชจ๋ธ์€ ๊ณต๊ฐ„์  ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ๋„ ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ์„ธ๋ฐ€ํ•œ ๊ณต๊ฐ„ ๊ด€๊ณ„๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์‹คํ—˜์  ๊ฒ€์ฆ ๋ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ: ๋‹ค์–‘ํ•œ VLM ์•„ํ‚คํ…์ฒ˜์™€ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์—์„œ Circle-RoPE์™€ AGE๋ฅผ ์ ์šฉํ–ˆ์„ ๋•Œ, ๊ณต๊ฐ„ ์ ‘์ง€(spatial grounding) ๋ฐ ์‹œ๊ฐ์  ์ถ”๋ก (visual reasoning) ์„ฑ๋Šฅ์ด ์ผ๊ด€์ ์œผ๋กœ ํ–ฅ์ƒ๋จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘