Sign In

AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Created by
  • Haebom
Category
Empty

μ €μž

Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han

πŸ’‘ κ°œμš”

κΈ°μ‘΄ Agent 벀치마크의 높은 ν™˜κ²½ μƒν˜Έμž‘μš© μ˜€λ²„ν—€λ“œμ™€ λΆˆκ· ν˜•ν•œ μž‘μ—… λ‚œμ΄λ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄, AgentCE-BenchλŠ” μˆ¨κ²¨μ§„ μŠ¬λ‘―μ„ μ±„μš°λŠ” 톡합 κ·Έλ¦¬λ“œ 기반 κ³„νš 과제λ₯Ό μ œμ•ˆν•©λ‹ˆλ‹€. 이 λ²€μΉ˜λ§ˆν¬λŠ” μˆ¨κ²¨μ§„ 슬둯 수($H$)둜 μž‘μ—…μ˜ λ²”μœ„λ₯Ό ν™•μž₯ν•˜κ³ , μ˜€ν•΄μ˜ μ†Œμ§€κ°€ μžˆλŠ” 후보 수λ₯Ό μ œμ–΄ν•˜λŠ” μ˜ˆμ‚°($B$)으둜 λ‚œμ΄λ„λ₯Ό μ‘°μ ˆν•  수 μžˆλŠ” 두 κ°€μ§€ 독립적인 좕을 μ œκ³΅ν•©λ‹ˆλ‹€. κ²½λŸ‰ ν™˜κ²½ 섀계λ₯Ό 톡해 ν™˜κ²½ μ„€μ • μ˜€λ²„ν—€λ“œλ₯Ό μ œκ±°ν•˜κ³  λΉ λ₯΄κ³  μž¬ν˜„ κ°€λŠ₯ν•œ 평가λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€.

πŸ”‘ μ‹œμ‚¬μ  및 ν•œκ³„

β€’
AgentCE-BenchλŠ” μž‘μ—… λ²”μœ„μ™€ λ‚œμ΄λ„λ₯Ό μ‹ λ’°μ„± 있게 μ œμ–΄ν•  수 있으며, λͺ¨λΈμ˜ 차별성을 효과적으둜 λ³΄μ—¬μ€λ‹ˆλ‹€.
β€’
13개 λͺ¨λΈμ— λŒ€ν•œ κ΄‘λ²”μœ„ν•œ μ‹€ν—˜μ„ 톡해 λͺ¨λΈ κ°„ μ„±λŠ₯ 차이λ₯Ό λͺ…ν™•νžˆ νŒŒμ•…ν•˜κ³ , Agent 좔둠에 λŒ€ν•œ 해석 κ°€λŠ₯ν•˜κ³  μ œμ–΄ κ°€λŠ₯ν•œ 평가λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
β€’
λ‹€μ–‘ν•œ 크기와 κ³„μ—΄μ˜ λͺ¨λΈλ“€μ„ λŒ€μƒμœΌλ‘œ 6개 λ„λ©”μΈμ—μ„œ μ‹€ν—˜μ„ μ§„ν–‰ν•˜μ—¬ AgentCE-Bench의 μœ μš©μ„±μ„ μž…μ¦ν–ˆμŠ΅λ‹ˆλ‹€.
πŸ‘