Sign In

Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models

Created by
  • Haebom
Category
Empty

์ €์ž

Hong Jia, Weibin Li, Jingyao Wu, Xiaofeng Yu, Yan Gao, Jintao Cheng, Xiaoyu Tang, Feng Xia, Ting Dang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์ธ๊ฐ„์˜ ๋ฐœํ™”์—์„œ ๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ๋ฐ ์žˆ์–ด ๋ฒ”์ฃผํ˜• ๋ถ„๋ฅ˜์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ , ์‹ค์ œ ๊ฐ์ •์˜ ๋ชจํ˜ธํ•จ๊ณผ ๋งฅ๋ฝ ์˜์กด์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ์˜ค๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ(ALM)๊ณผ ์‹œํ—˜ ์‹œ๊ฐ„ ์Šค์ผ€์ผ๋ง(TTS) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจํ˜ธํ•œ ๊ฐ์ • ์ธ์‹ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ณ , ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์ด ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ๋ฐ ์ ์‘ ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‚ฌํšŒ์ ์œผ๋กœ ์ธ์‹ ๊ฐ€๋Šฅํ•œ ๋Œ€ํ™”ํ˜• AI ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ํ† ๋Œ€๋ฅผ ๋งˆ๋ จํ•˜๊ณ , ๋ชจ๋ธ์˜ ๊ฐ€์ •๊ณผ ์ธ๊ฐ„ ๊ฐ์ •์˜ ๋ณต์žก์„ฑ ์‚ฌ์ด์˜ ๊ฐ„๊ทน์„ ์ขํžˆ๋Š” ๋ฐ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์‹ค์ œ ์ธ๊ฐ„ ๊ฐ์ •์˜ ๋ณต์žก์„ฑ๊ณผ ๋ชจํ˜ธํ•จ์„ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋Œ€๊ทœ๋ชจ ์˜ค๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ๊ณผ ์‹œํ—˜ ์‹œ๊ฐ„ ์Šค์ผ€์ผ๋ง ๊ธฐ๋ฒ•์˜ ์กฐํ•ฉ์ด ๋ชจํ˜ธํ•œ ๊ฐ์ • ์ธ์‹ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋œ ๋ฒค์น˜๋งˆํฌ์™€ ๋ถ„์„ ๊ฒฐ๊ณผ๋Š” ํ–ฅํ›„ ๋”์šฑ ์ •๊ตํ•˜๊ณ  ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๋Š” ๊ฐ์ • ์ธ์‹ AI ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์— ์ค‘์š”ํ•œ ์ง€์นจ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ชจํ˜ธํ•œ ๊ฐ์ • ์ธ์‹์— ๋Œ€ํ•œ ์‹ฌ์ธต์ ์ธ ์ดํ•ด๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ, ์‹ค์ œ ์ ์šฉ ์‹œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํŽธํ–ฅ ๋ฐ ์œค๋ฆฌ์  ๊ณ ๋ ค์‚ฌํ•ญ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘