Sign In

Toward Scalable Audio Description Quality Control: A Workflow for Evaluating Human and VLM Raters

Created by
  • Haebom
Category
Empty

์ €์ž

Lana Do, Gio Jung, Juvenal Francisco Barajas, Andrew Taylor Scott, Shasta Ihorn, Alexander Mario Blum, Vassilis Athitsos, Ilmi Yoon

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์˜ค๋””์˜ค ์„ค๋ช…(AD)์˜ ํ’ˆ์งˆ์„ ๋Œ€๊ทœ๋ชจ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์žˆ์–ด ๊ธฐ์กด ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์•„์ดํ…œ ๋ฐ˜์‘ ์ด๋ก (Item Response Theory)์„ ํ™œ์šฉํ•˜์—ฌ VLM(Vision-Language Model)๊ณผ ์‚ฌ๋žŒ ํ‰๊ฐ€์ž์˜ ์ˆ™๋ จ๋„๋ฅผ ์ „๋ฌธ๊ฐ€ ๊ธฐ์ค€์— ๋งž์ถฐ ํ‰๊ฐ€ํ•˜๋Š” ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, ์ตœ์‹  VLM์ด ์‚ฌ๋žŒ ํ‰๊ฐ€์ž ์ˆ˜์ค€์œผ๋กœ AD ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋‚˜, VLM์˜ ์ถ”๋ก  ๊ณผ์ •์€ ์‚ฌ๋žŒ๋ณด๋‹ค ๋œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
VLM์ด AD ํ’ˆ์งˆ ํ‰๊ฐ€์—์„œ ์ธ๊ฐ„ ํ‰๊ฐ€์ž ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์–ด, ์ž๋™ํ™”๋œ ํ’ˆ์งˆ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ ๊ตฌ์ถ•์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
VLM๊ณผ ์ธ๊ฐ„ ํ‰๊ฐ€์ž์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ‰๊ฐ€ ์‹œ์Šคํ…œ์€ AD ํ’ˆ์งˆ ๊ด€๋ฆฌ์˜ ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
VLM์˜ ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ •์ด ์ธ๊ฐ„๋ณด๋‹ค ๋œ ํˆฌ๋ช…ํ•˜๊ณ  ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค๋Š” ์ ์€ ์‹ค์งˆ์ ์ธ ํ”ผ๋“œ๋ฐฑ ์ œ๊ณต์— ์ œ์•ฝ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋œ ์›Œํฌํ”Œ๋กœ์šฐ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ๋ฐ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ AD์— ๋Œ€ํ•œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘