Sign In

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Created by
  • Haebom
Category
Empty

์ €์ž

Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ๊ฐ„๊ณผํ•ด ์˜จ ๊ฐ์ •, ํ–‰๋™, ๊ต์ฐจ ์–‘์‹ ์ •๋ ฌ ๋“ฑ ๋ฏธ๋ฌ˜ํ•œ ์ธ๊ฐ„ ์ค‘์‹ฌ ๋น„๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์€ 16๊ฐ€์ง€ ์„ธ๋ถ„ํ™”๋œ ์ž‘์—…์„ ํฌ๊ด„ํ•˜๋Š” HumanVBench๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ตœ์†Œํ•œ์˜ ์ธ๊ฐ„ ๋…ธ๋ ฅ์œผ๋กœ ๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ์ฃผ์„๊ณผ ๋„์ „์ ์ธ ๊ฐ๊ด€์‹ ์งˆ๋ฌธ์„ ์ž๋™ ํ•ฉ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด, ๋ณธ ์—ฐ๊ตฌ๋Š” MLLM์˜ ์ธ๊ฐ„ ์ค‘์‹ฌ ๋น„๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์˜ ํ˜„์ €ํ•œ ๋ถ€์กฑํ•จ์„ ๋ฐํ˜€๋ƒˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
HumanVBench๋Š” MLLM์˜ ์ธ๊ฐ„ ์ค‘์‹ฌ ๋น„๋””์˜ค ์ดํ•ด, ํŠนํžˆ ๊ฐ์ • ๋ฐ ๊ต์ฐจ ์–‘์‹ ์ •๋ ฌ ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ž๋™ํ™”๋œ ํ•ฉ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ธ๊ฐ„์˜ ๋…ธ๋ ฅ์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ๋„ ์„ฌ์„ธํ•˜๊ณ  ๋ณต์žกํ•œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ตœ์ฒจ๋‹จ MLLM๋“ค๋„ ๋ฏธ๋ฌ˜ํ•œ ๊ฐ์ • ์ธ์‹๊ณผ ์‹œ๊ฐ ์ •๋ณด์™€์˜ ์Œ์„ฑ ์ •๋ ฌ์—์„œ ์ธ๊ฐ„ ์„ฑ๋Šฅ์— ํฌ๊ฒŒ ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
๐Ÿ‘