Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Preacher: Paper-to-Video Agentic System

Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis

Decentralized Weather Forecasting via Distributed Machine Learning and Blockchain-Based Model Validation

Biased AI improves human decision-making but reduces trust

Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Yan: Foundational Interactive Video Generation

M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction

LLM-Driven Adaptive 6G-Ready Wireless Body Area Networks: Survey and Framework

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

On Understanding of the Dynamics of Model Capacity in Continual Learning

WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer

Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities

Echoes of Automation: The Increasing Use of LLMs in Newsmaking

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction

Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference

MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Self-Questioning Language Models

Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base

Class-Proportional Coreset Selection for Difficulty-Separable Data

Warehouse Spatial Question Answering with LLM Agent

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

AmpLyze: A Deep Learning Model for Predicting the Hemolytic Concentration

EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision

GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Discrepancy-Aware Graph Mask Auto-Encoder

Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images

PromptTSS: A Prompting-Based Approach for Interactive Multi-Granularity Time Series Segmentation

15,500 Seconds: Lean UAV Classification Using EfficientNet and Lightweight Fine-Tuning

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Data Pruning by Information Maximization

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

Security Concerns for Large Language Models: A Survey

Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing

Unraveling the iterative CHAD

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free

Adaptive Budgeted Multi-Armed Bandits for IoT with Dynamic Resource Constraints

Vision Transformers in Precision Agriculture: A Comprehensive Survey

Goal-Oriented Time-Series Forecasting: Foundation Framework Design

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

FinSage: A Multi-aspect RAG System for Financial Filings Question Answering

GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes

Hyperflux: Pruning Reveals the Importance of Weights

ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Continual Learning for Multiple Modalities

Advancing MAPF towards the Real World: A Scalable Multi-Agent Realistic Testbed (SMART)

LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

Boosting Cross-problem Generalization in Diffusion-Based Neural Combinatorial Solver via Inference Time Adaptation

Rhythmic sharing: A bio-inspired paradigm for zero-shot adaptive learning in neural networks

Measuring Diversity in Synthetic Datasets

Delayed Feedback Modeling with Influence Functions

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

A Lightweight Transformer with Phase-Only Cross-Attention for Illumination-Invariant Biometric Authentication

Understanding Transformer-based Vision Models through Inversion

INSIGHT: Explainable Weakly-Supervised Medical Image Analysis

Visual SLAMMOT Considering Multiple Motion Models

A Training-Free Approach for Music Style Transfer with Latent Diffusion Models

Multi-objective Optimization in CPU Design Space Exploration: Attention is All You Need

DiRW: Path-Aware Digraph Learning for Heterophily

Diversifying Policy Behaviors with Extrinsic Behavioral Curiosity

Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience

Neural Networks Generalize on Low Complexity Data

Knowledge-based Consistency Testing of Large Language Models

Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning

An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach

Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions

Mathematical Computation and Reasoning Errors by Large Language Models

OpenCUA: Open Foundations for Computer-Use Agents

Compass-Thinker-7B Technical Report

TextQuests: How Good are LLMs at Text-Based Video Games?

On the Definition of Intelligence

Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models

A Random-Key Optimizer for Combinatorial Optimization

Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Created by

Haebom

저자

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

개요

본 논문은 대규모 언어 모델(LLM)을 코드 평가자(LLM-as-a-Judge)로 활용하는 새로운 벤치마크인 CodeJudgeBench를 소개합니다. CodeJudgeBench는 코드 생성, 코드 수정, 단위 테스트 생성 세 가지 코딩 작업에 걸쳐 LLM-as-a-Judge 모델의 성능을 평가하도록 설계되었습니다. 26개의 LLM-as-a-Judge 모델을 종합적으로 벤치마킹한 결과, 사고 능력이 있는 최신 모델이 사고 능력이 없는 모델보다 성능이 훨씬 뛰어나다는 것을 발견했습니다. Qwen3-8B와 같이 비교적 작은 사고 모델조차도 최대 70B 크기의 특별히 훈련된 LLM-as-a-Judge 모델보다 성능이 70%까지 앞서는 경우도 있습니다. 그러나 모든 모델은 코딩 작업 판단에 상당한 임의성을 보였으며, 쌍별 비교 작업의 경우 응답 제시 순서만 변경해도 정확도에 상당한 영향을 미쳤습니다. 또한, 서로 다른 LLM이 작성한 코드와 단위 테스트를 판단할 때 LLM-as-a-Judge 모델의 성능이 달라지는 것도 확인되었습니다. 이러한 민감도는 코딩 시나리오에서 LLM-as-a-Judge의 신뢰성과 일관성에 대한 우려를 제기합니다. 마지막으로, LLM-as-a-Judge를 위한 최적의 프롬프팅 전략을 연구하여 쌍별 비교가 단일 점수 판정보다 성능이 우수하며, 처리되지 않은 전체 LLM 응답에서 주석과 추론을 유지하는 것이 판정 성능을 향상시킨다는 것을 발견했습니다.

시사점, 한계점

•

시사점:

◦

CodeJudgeBench는 LLM-as-a-Judge 모델의 성능을 평가하기 위한 표준 벤치마크를 제공합니다.

◦

사고 능력이 있는 LLM이 코드 평가 작업에서 더 나은 성능을 보입니다.

◦

상대적으로 작은 모델도 큰 모델을 능가할 수 있습니다.

◦

쌍별 비교 및 주석과 추론 정보 포함 프롬프팅 전략이 효과적임을 확인했습니다.

•

한계점:

◦

모든 LLM-as-a-Judge 모델은 여전히 상당한 임의성을 보입니다.

◦

응답 제시 순서에 따라 판단 결과가 크게 달라질 수 있습니다.

◦

서로 다른 LLM이 생성한 코드에 대한 평가 결과에 일관성이 부족합니다.

◦

LLM-as-a-Judge의 신뢰성과 일관성에 대한 우려가 제기됩니다.

Made with Slashpage