Sign In

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Created by
  • Haebom
Category
Empty

์ €์ž

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Rottger

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ์ธ๊ฐ„ ํ–‰๋™์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํฌ์ธ SimBench๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SimBench๋Š” 20๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•˜์—ฌ ๋„๋•์  ์˜์‚ฌ๊ฒฐ์ •๋ถ€ํ„ฐ ๊ฒฝ์ œ์  ์„ ํƒ๊นŒ์ง€ ํญ๋„“์€ ๊ณผ์ œ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ, LLM ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ์ถฉ์‹ค๋„๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ์ตœ๊ณ ์˜ LLM์€ ์ธ๊ฐ„ ํ–‰๋™์„ ์–ด๋А ์ •๋„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ(40.80/100์ ), ์—ฌ์ „ํžˆ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋งŽ์œผ๋ฉฐ, ํŠนํžˆ ํŠน์ • ์ธ๊ตฌ ์ง‘๋‹จ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
SimBench๋Š” LLM์˜ ์ธ๊ฐ„ ํ–‰๋™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋Šฅ๋ ฅ์„ ๊ฐ๊ด€์ ์ด๊ณ  ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
LLM์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋Šฅ๋ ฅ์€ ๋ชจ๋ธ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋กœ๊ทธ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฉฐ, ์ง€์‹ ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ LLM์€ ์ง€์นจ ํŠœ๋‹ ์‹œ ์ €์—”ํŠธ๋กœํ”ผ ์งˆ๋ฌธ์—๋Š” ๊ฐ•์ ์„ ๋ณด์ด๋‚˜ ๊ณ ์—”ํŠธ๋กœํ”ผ ์งˆ๋ฌธ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ์ƒ์ถฉ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉฐ, ํŠน์ • ์ธ๊ตฌ ์ง‘๋‹จ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ๋Œ€ํ•œ ๊ฐœ์„ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” LLM ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋Šฅ๋ ฅ์˜ ๋ฐœ์ „์„ ์ธก์ • ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ๋” ์ถฉ์‹คํ•œ LLM ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ฐœ๋ฐœ์„ ๊ฐ€์†ํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘