Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network

Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining

HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment

TraceTrans: Translation and Spatial Tracing for Surgical Prediction

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora

Your Dense Retriever is Secretly an Expeditious Reasoner

FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation

Context-level Language Modeling by Learning Predictive Context Embeddings

Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving

The Formalism-Implementation Gap in Reinforcement Learning Research

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Cross-Scenario Unified Modeling of User Interests at Billion Scale

DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

SEER: The Span-based Emotion Evidence Retrieval Benchmark

Distilled Protein Backbone Generation

Untargeted Jailbreak Attack

AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM

Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

The human-machine paradox: how collaboration creates or destroys value, and why augmentation is key to resolving it

Reproducible workflow for online AI in digital health

Pre-trained knowledge elevates large language models beyond traditional chemical reaction optimizers

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

Robustness is Important: Limitations of LLMs for Data Fitting

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

FoGE: Fock Space inspired encoding for graph prompting

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Thermometry of simulated Bose--Einstein condensates using machine learning

LittleBit: Ultra Low-Bit Quantization via Latent Factorization

BNMusic: Blending Environmental Noises into Personalized Music

Evaluating AI-Powered Learning Assistants in Engineering Higher Education: Student Engagement, Ethical Challenges, and Policy Implications

Mixture-of-Experts Meets In-Context Reinforcement Learning

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings

FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

STree: Speculative Tree Decoding for Hybrid State-Space Models

Do Language Models Use Their Depth Efficiently?

A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation

The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics

Group-in-Group Policy Optimization for LLM Agent Training

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Offline Learning and Forgetting for Reasoning with Large Language Models

Multimodal 3D Genome Pre-training

Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

$\beta$-DQN: Improving Deep Q-Learning By Evolving the Behavior

Provable Scaling Laws for the Test-Time Compute of Large Language Models

Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions

One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration

GRS: Generating Robotic Simulation Tasks from Real-World Images

Navigation with VLM framework: Towards Going to Any Language

Retrieval-Augmented Generation-based Relation Extraction

Diffusion Models Meet Contextual Bandits

Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

ReCode: Unify Plan and Action for Universal Granularity Control

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports

Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles

PanicToCalm: A Proactive Counseling Agent for Panic Attacks

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research

MathBode: Understanding LLM Reasoning with Dynamical Systems

Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem

A Neuroscience-Inspired Dual-Process Model of Compositional Generalization

Memory Mosaics at scale

The Confidence Paradox: Can LLM Know When It's Wrong

VIRAL: Vision-grounded Integration for Reward design And Learning

Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Created by

Haebom

저자

Haizhong Zheng, Jiawei Zhao, Beidi Chen

개요

강화 학습은 대규모 언어 모델 추론 발전에 핵심적인 역할을 해왔지만, 대부분의 알고리즘은 매 업데이트마다 새로운 롤아웃이 필요한 온-정책 훈련에 의존하여 효율성과 확장성을 제한합니다. 비동기 RL 시스템은 롤아웃 생성과 훈련을 분리하여 이 문제를 완화하지만, 롤아웃 데이터의 큰 낡음을 견디는 데 달려있으며, 이 경우 기존 방법은 성능이 저하되거나 붕괴됩니다. 본 연구에서는 이 문제에 대해 재고하고, 적절히 활용하면 낡은 데이터가 온-정책 데이터만큼 유익할 수 있다는 '붕괴 전 번영' 현상을 발견했습니다. 이를 바탕으로, M2PO (Second-Moment Trust Policy Optimization)를 제안하여 중요도 가중치의 두 번째 모멘트를 제한함으로써 극단적인 이상치를 억제하면서 유익한 업데이트를 보존합니다. M2PO는 높은 낡음 조건에서 잘린 토큰의 비율을 크게 줄이며, 고분산 토큰을 정확하게 마스킹하면서 안정적인 최적화를 유지합니다. 여섯 개의 모델과 여덟 개의 벤치마크에 대한 광범위한 평가를 통해 M2PO가 최소 256번의 모델 업데이트로 낡은 데이터에서도 안정적인 오프-정책 훈련을 제공하고, 온-정책 성능에 필적함을 보여줍니다.

시사점, 한계점

•

시사점:

◦

M2PO는 낡은 데이터를 활용하여 오프-정책 강화 학습의 효율성을 향상시킵니다.

◦

M2PO는 높은 낡음 환경에서도 안정적인 훈련을 가능하게 합니다.

◦

M2PO는 다양한 규모의 모델과 벤치마크에서 우수한 성능을 보입니다.

•

한계점:

◦

M2PO의 성능은 두 번째 모멘트 제약 조건의 적절한 설정에 의존할 수 있습니다.

◦

구체적인 하이퍼파라미터 설정에 대한 자세한 정보는 논문에 제시되지 않았습니다.

◦

오프라인 데이터셋 환경에서의 실험 결과는 제공되지 않았습니다.

Made with Slashpage