Sign In

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ ํ™œ๋™์— ๋Œ€ํ•œ ๋ฏธ์„ธํ•œ ์ดํ•ด๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š” ๊ธฐ์กด Vision-Language Model(VLM)์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 'FineBench'๋ผ๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์ธ๊ฐ„ ์ค‘์‹ฌ ๋น„๋””์˜ค ์งˆ์˜์‘๋‹ต ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. FineBench๋Š” ๊ธด ํ˜•์‹์˜ ๋น„๋””์˜ค์— ๋Œ€ํ•ด ์„ธ๋ฐ€ํ•œ ๋™์ž‘, ์ƒํ˜ธ์ž‘์šฉ, ๊ฐ์ฒด ์กฐ์ž‘์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ์งˆ์˜์‘๋‹ต ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด VLM์˜ ๊ณต๊ฐ„์  ์ถ”๋ก  ๋ฐ ๋ฏธ๋ฌ˜ํ•œ ์›€์ง์ž„ ์ดํ•ด ๋Šฅ๋ ฅ ๋ถ€์กฑ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ์•ฝ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Localizer์™€ Descriptor๋ฅผ ํ™œ์šฉํ•œ 'FineAgent' ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์—ฌ, ๋‹ค์–‘ํ•œ ์˜คํ”ˆ ์†Œ์Šค VLM์˜ ์„ฑ๋Šฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
FineBench๋Š” ๊ธด ํ˜•์‹ ๋น„๋””์˜ค์—์„œ ์ธ๊ฐ„ ํ™œ๋™์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ดํ•ด๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
FineAgent๋Š” ํ˜„์žฌ VLM์˜ ๋ฏธ์„ธํ•œ ์ธ๊ฐ„ ํ™œ๋™ ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‹ค์งˆ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชจ๋“ˆํ˜• ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ˜„์กดํ•˜๋Š” ์˜คํ”ˆ ์†Œ์Šค VLM์€ ๋ณต์žกํ•œ ๋‹ค์ค‘ ์ธ๋ฌผ ์ƒํ™ฉ์—์„œ์˜ ๊ณต๊ฐ„ ์ถ”๋ก  ๋ฐ ๋ฏธ๋ฌ˜ํ•œ ์ธ๊ฐ„ ์›€์ง์ž„ ๊ตฌ๋ณ„์— ์—ฌ์ „ํžˆ ํฐ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
โ€ข
FineBench์˜ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋ฐ ์ฃผ์„ ์ž‘์—…์˜ ์ •ํ™•์„ฑ๊ณผ ์ผ๊ด€์„ฑ ์œ ์ง€, ๊ทธ๋ฆฌ๊ณ  FineAgent์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋ฐ ํšจ์œจ์„ฑ ๊ฐœ์„ ์ด ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘