[공지사항]을 빙자한 안부와 근황
Show more
/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout
Energy Efficiency in AI for 5G and Beyond: A DeepRx Case Study
A PBN-RL-XAI Framework for Discovering a "Hit-and-Run" Therapeutic Strategy in Melanoma
(Almost) Free Modality Stitching of Foundation Models
Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning
Tree-Structured Parzen Estimator Can Solve Black-Box Combinatorial Optimization More Efficiently
EXPO: Stable Reinforcement Learning with Expressive Policies
Reinforcement Learning with Action Chunking
On the Effect of Instruction Tuning Loss on Generalization
Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models
Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why
DRAGON: Dynamic RAG Benchmark On News
Solar Flare Prediction Using Long Short-term Memory (LSTM) and Decomposition-LSTM with Sliding Window Pattern Recognition
Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching
RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence
Stylometry recognizes human and LLM-generated texts in short samples
QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration
Evaluating Multimodal Large Language Models on Educational Textbook Question Answering
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation
Alleviating User-Sensitive bias with Fair Generative Sequential Recommendation Model
MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications
DeInfoReg: A Decoupled Learning Framework for Better Training Throughput
FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products
The Limits of Tractable Marginalization
A quantum semantic framework for natural language processing
ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols
Deepfake Technology Unveiled: The Commoditization of AI and Its Impact on Digital Trust
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Matrix Is All You Need
Temporal Chunking Enhances Recognition of Implicit Sequential Patterns
Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems
PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models
Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media
Rethinking the Foundations for Continual Reinforcement Learning
Compositional Flows for 3D Molecule and Synthesis Pathway Co-design
Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Embedding
Speculative Automated Refactoring of Imperative Deep Learning Programs to Graph Execution
Test-time Adaptation for Foundation Medical Segmentation Model without Parametric Updates
Style over Substance: Distilled Language Models Reason Via Stylistic Replication
AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization
Multi-View Node Pruning for Accurate Graph Representation
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Voting or Consensus? Decision-Making in Multi-Agent Debate
Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support
A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens
Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs
Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction
Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors
Few-Shot Radar Signal Recognition through Self-Supervised Learning and Radio Frequency Domain Adaptation
Transfer Learning Analysis of Variational Quantum Circuits
Plancraft: an evaluation dataset for planning with LLM agents
Fully Data-driven but Interpretable Human Behavioural Modelling with Differentiable Discrete Choice Model
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation
Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
Searching Latent Program Spaces
The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter
ComFairGNN: Community Fair Graph Neural Network
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Large Language Models Engineer Too Many Simple Features For Tabular Data
Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control
IdeaSynth: Iterative Research Idea Development Through Evolving and Composing Idea Facets with Literature-Grounded Feedback
SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning
Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy
SA-GDA: Spectral Augmentation for Graph Domain Adaptation
The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances
State-Constrained Offline Reinforcement Learning
SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection
Unified ODE Analysis of Smooth Q-Learning Algorithms
FairTargetSim: An Interactive Simulator for Understanding and Explaining the Fairness Effects of Target Variable Definition
Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models
Learning Safe Numeric Planning Action Models
Augmenting End-to-End Steering Angle Prediction with CAN Bus Data
EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning
GRAPES: Learning to Sample Graphs for Scalable Graph Neural Networks
Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation
Working with AI: Measuring the Occupational Implications of Generative AI
Establishing Best Practices for Building Rigorous Agentic Benchmarks
An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design
Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models
BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Load more
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
Created by
Haebom
作者
Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang
概要
本論文は、強化学習を通じて推論能力を向上させる大規模言語モデル(LLM)の検証問題を扱う。モデル生成応答と参照応答の一貫性の検証は、応答の長さ、多様性、ニュアンスのために困難です。ルールベースの検証器は複雑さに苦しみ、モデルベースの検証器が使用されますが、特殊化された検証器は柔軟性が不足し、一般的なLLM判断器は一貫性がありません。既存の研究はより良いベリファイアを作成することに集中していましたが、さまざまなタイプのベリファイア性能に対する体系的なドメイン間の比較評価が不足しており、検証可能な補償を使用した強化学習(RLVR)の信頼できる開発を制限しています。これを解決するために、この論文は検証者を体系的に評価するためのクロスドメイン包括的なベンチマークであるVerifyBenchを提案します。数学、物理、化学、生物学をカバーする4,000の専門家レベルの質問と、各質問に対する参照回答とさまざまな回答を構成します。多学制専門家チームが行った厳格な注釈プロセスを通じて評価の信頼性を確保する。抽出された回答と完全な応答、短い出力と長い出力の組み合わせ条件下で、特殊化された検証器と一般LLMの性能境界を包括的に比較するための4次元実験フレームワークを設計します。評価の結果、検証機の基本的なトレードオフを明らかにする。特殊化された検証器は高精度を達成するが再現率が不足し、一般モデルはより強力な包括性を示すが精度は不安定である。さらに重要なことは、検証器の入力構造に対する高い感度とドメイン間一般化の固有の制限を発見し、現在の検証器技術のボトルネックに関する重要な洞察を提供することです。
Takeaways、Limitations
•
Takeaways:
さまざまなドメインをカバーするVerifyBenchベンチマークを使用して、LLM検証器のパフォーマンスを体系的に比較評価するための基盤を築きました。特殊化された検証器と一般LLM検証器の性能差と限界を明確に明らかにすることで、今後のLLM検証器の開発方向を示した。入力構造とドメイン間の一般化の重要性を強調し、今後の研究の焦点を示した。
•
Limitations:
VerifyBenchは4,000の質問で構成されていますが、さらにさまざまな種類の質問と回答を含むベンチマークの包括性を高める必要があります。現在、ベンチマークで使用されている専門家評価の主観性を最小限に抑えるための追加の研究が必要です。ドメイン間の一般化に対する限界を明らかにしたが、これを克服するための具体的な解決策は提示できなかった。
PDFを見る
Made with Slashpage