Sign In

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Created by
  • Haebom
Category
Empty

์ €์ž

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ฒ€์ƒ‰ ์ฆ๊ฐ• LLM ์—์ด์ „ํŠธ๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ์‹ฌ์ธต ์—ฐ๊ตฌ ๋ณด๊ณ ์„œ(DRRs)์˜ ์‚ฌ์‹ค์„ฑ ๊ฒ€์ฆ์ด ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์‚ฌ์‹ค ๊ฒ€์ฆ๊ธฐ๋Š” ์ผ๋ฐ˜์ ์ธ ๋ช…๋ฃŒํ•œ ์ฃผ์žฅ์— ๋งž์ถฐ์ ธ ์žˆ์–ด DRRs์—๋Š” ํšจ๊ณผ์ ์ด์ง€ ์•Š์œผ๋ฉฐ, ์ด๋ฅผ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ๋„ ๋ถ€์žฌํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ ๋™์ ์ธ ๊ฐ์‚ฌ-์ ์ˆ˜(AtS) ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์—ฌ, ๊ฒ€์ฆ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถˆ์ผ์น˜๋ฅผ ๊ฐ์‚ฌํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ•จ์œผ๋กœ์จ ์ „๋ฌธ๊ฐ€์˜ ์ •ํ™•๋„๋ฅผ 60.8%์—์„œ 90.9%๊นŒ์ง€ ๋†’์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
DRRs์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ํ…์ŠคํŠธ์˜ ์‚ฌ์‹ค์„ฑ ๊ฒ€์ฆ์„ ์œ„ํ•ด ์ •์ ์ธ ๋ฒค์น˜๋งˆํฌ ๋Œ€์‹  ๋™์ ์œผ๋กœ ์ง„ํ™”ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•์˜ ์ค‘์š”์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€๋„ ํ•œ ๋ฒˆ์˜ ํ‰๊ฐ€๋กœ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ์‚ฌ ๋ฐ ์žฌ๊ฒ€ํ†  ๊ณผ์ •์„ ํ†ตํ•ด ์‹ ๋ขฐ๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ DeepFact-Bench์™€ DeepFact-Eval์ด DRR ์‚ฌ์‹ค์„ฑ ๊ฒ€์ฆ ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์™€ ํšจ๊ณผ์ ์ธ ๊ฒ€์ฆ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋™์  ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ• ๋ฐ ๊ฐ์‚ฌ ํ”„๋กœ์„ธ์Šค์˜ ์ž๋™ํ™” ๋ฐ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘