Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 요약본 공유 시 출처만 명기하면 됩니다.
This service is supported by Google Gemini.

Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics

Predictive and Prescriptive AI toward Optimizing Wildfire Suppression

Structured Progressive Knowledge Activation for LLM-Driven Neural Architecture Search

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification

Are we Doomed to an AI Race? Why Self-Interest Could Drive Countries Towards a Moratorium on Superintelligence

Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning

H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

Caracal: Causal Architecture via Spectral Mixing

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning

Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

Enhancing Speaker Verification with Whispered Speech via Post-Processing

Information Aggregation with AI Agents

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

Latent Abstraction for Retrieval-Augmented Generation

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

Screening Is Enough

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

Prediction-Based Markov Violation Scores for Detecting Non-Markovian Observations in Reinforcement Learning

P^2O: Joint Policy and Prompt Optimization

Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal

Epistemic Observability in Language Models

Adaptive Greedy Frame Selection for Long Video Understanding

Spectral Alignment in Forward-Backward Representations via Temporal Abstraction

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

MetaKE: Meta-Learning for Knowledge Editing Toward a Better Accuracy-Editability Trade-off

Unsupervised Anomaly Detection in Wearable Foot Sensor Data: A Baseline Feasibility Study Towards Diabetic Foot Ulcer Prevention

Quantifying Hallucinations in Language Language Models on Medical Textbooks

DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

Path Dependence under Adaptive AI Delegation

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

PEPA: a Persistently Autonomous Embodied Agent with Personalities

AI Agents Alone Are Not (Yet) Sufficient for Social Simulation

Same Words, Different Judgments: How Preferences Vary Across Modalities

CAMEL: Confidence-Gated Reflection for Reward Modeling

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Molecular Design beyond Training Data with Novel Extended Objective Functionals of Generative AI Models Driven by Quantum Annealing Computer

On the Rate-Distortion-Complexity Tradeoff for Semantic Communication

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

Risk Horizons: Structured Hypothesis Spaces for Longitudinal Clinical Prediction

Multimodal Fact-Level Attribution for Verifiable Reasoning

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Action-to-Action Flow Matching

A Theoretical Analysis of Test-Driven Code Generation

Parity, Sensitivity, and Transformers

It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

AROpt: An Optimization Method for Autoregressive Time Series Forecasting

DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Toward Scalable Audio Description Quality Control: A Workflow for Evaluating Human and VLM Raters

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English

Keep Rehearsing and Refining: Lifelong Learning Vehicle Routing under Continually Drifting Tasks

Leviathan: Decoupling Input and Output Representations in Language Models

FIT to Forget: Robust Continual Unlearning for Large Language Models

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Fast and Efficient Gossip Algorithms for Robust and Non-smooth Decentralized Learning

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding

AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability

Dynamic Expert-Guided Model Averaging for Causal Discovery

Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

SoccerMaster: A Vision Foundation Model for Soccer Understanding

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Continually Evolving Skill Knowledge in Vision Language Action Model

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

A Practitioner's Guide to Kolmogorov-Arnold Networks

Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

Created by

Haebom

저자

Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, Benoit Boulet

개요

본 논문은 대규모 언어 모델(LLM)을 이용한 성능 평가의 효율성을 높이기 위해, 기존 연구의 한계점인 인간 평가자의 편향과 실수, 그리고 다수의 LLM 응답 중 적절한 응답 선택 문제를 해결하는 세 단계 메타 판정자 선택 파이프라인을 제안합니다. GPT-4와 인간 전문가를 활용하여 포괄적인 평가 기준을 개발하고, 세 개의 고급 LLM 에이전트를 사용하여 판정 점수를 매기며, 임계값을 적용하여 낮은 점수의 판정을 걸러내는 방식입니다. JudgeBench 데이터셋을 이용한 실험 결과, 기존 단일 LLM 기반 방법 대비 약 8.37%, 원시 판정 대비 약 15.55% 향상된 성능을 보였습니다. 이는 LLM을 메타 판정자로 활용하는 잠재력을 보여주며, LLM 기반 강화 학습을 위한 선호도 데이터셋 구축 연구의 기반을 마련합니다.

시사점, 한계점

•

시사점:

◦

LLM을 메타 판정자로 활용하여 LLM 성능 평가의 효율성을 높일 수 있는 새로운 방법 제시.

◦

다중 LLM 에이전트 협업과 포괄적인 평가 기준을 통해 기존 단일 LLM 기반 방법보다 향상된 성능 달성.

◦

LLM 기반 강화 학습을 위한 선호도 데이터셋 구축 연구에 기여.

◦

인간 평가자의 편향과 실수를 줄일 수 있는 잠재력 제시.

•

한계점:

◦

제안된 파이프라인의 성능 향상은 특정 데이터셋(JudgeBench)에 국한될 수 있음.

◦

GPT-4와 인간 전문가를 활용한 평가 기준 개발 과정의 자세한 설명 부족.

◦

사용된 LLM 에이전트의 구체적인 종류와 매개변수 설정에 대한 정보 부족.

◦

임계값 설정에 대한 명확한 기준 제시 부족.

◦

다른 유형의 평가 과제에 대한 일반화 가능성에 대한 추가 연구 필요.

PDF 보기

Made with Slashpage