Daily Arxiv

世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。

Language Models are Injective and Hence Invertible

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Latent Diffusion Model without Variational Autoencoder

Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Phenome-Wide Multi-Omics Integration Uncovers Distinct Archetypes of Human Aging

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

A Vision for Access Control in LLM-based Agent Systems

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Formally Verified Certification of Unsolvability of Temporal Planning Problems

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Online automatic code generation for robot swarms: LLMs and self-organizing hierarchy

A New Digital Divide? Coder Worldviews, the Slop Economy, and Democracy in the Age of AI

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Creative synthesis of kinematic mechanisms

Market-Driven Subset Selection for Budgeted Training

Mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

TimeEmb: A Lightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting

Learning Generalizable Shape Completion with SIM(3) Equivariance

Dolphin v1.0 Technical Report

A Measurement Study of Model Context Protocol Ecosystem

Diffusion Models are Kelly Gamblers

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Semantic Representation Attack against Aligned Large Language Models

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Accurate and Efficient Low-Rank Model Merging in Core Space

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

Graph Coloring for Multi-Task Learning

Robust LLM Training Infrastructure at ByteDance

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Communications to Circulations: Real-Time 3D Wind Field Prediction Using 5G GNSS Signals and Deep Learning

Why and How Auxiliary Tasks Improve JEPA Representations

Creativity Benchmark: A benchmark for marketing creativity for large language models

SpikingBrain: Spiking Brain-inspired Large Models

Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection

Epistemic Trade-Off: An Analysis of the Operational Breakdown and Ontological Limits of "Certainty-Scope" in AI

ZeST: an LLM ベースの Zero-Shot Traversability Navigation for Unknown Environments

Interpretable Decision-Making for End-to-End Autonomous Driving

A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs

Limitations of Normalization in Attention Mechanism

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

The GPT-4o Shock Emotional Attachment to AI Models and Its Impact on Regulatory Acceptance: A Cross-Cultural Analysis of the Immediate Transition from GPT-4o to GPT-5

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Objectc-Centric Representations from Pretrained Vision Models

VGGSounder: Audio-Visual Evaluations for Foundation Models

Evolution of AI Agent Registry Solutions: Centralized, Enterprise, and Distributed Approaches

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models

ReDi: Rectified Discrete Flow

Adaptive Policy Synchronization for Scalable Reinforcement Learning

From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation

Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

AI-Generated Video Detection via Perceptual Straightening

From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning

ADA-DPM: A Neural Descriptors-based Adaptive Noise Filtering Strategy for SLAM

GeNIE: A Generalizable Navigation System for In-the-Wild Environments

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Code Execution as Grounded Supervision for LLM Reasoning

Denoising the Future: Top-p Distributions for Moving Through Time

HauntAttack: When Attack Follows Reasoning as a Shadow

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

VERINA: Benchmarking Verifiable Code Generation

RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

The quest for the GRAph Level autoEncoder (GRALE)

Efficient Large Language Model Inference with Neural Block Linearization

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Created by

Haebom

作者

Saad Obaid ul Islam, Anne Lauscher, Goran Glava\v{s}

概要

大規模な言語モデル（LLM）は、「アインシュタインはいつ生まれたのか」のような簡単な質問に答えることができますが、アインシュタインの人生について説明するときは、同じ日付を提供できない根本的な不一致を示しています。 LLMは単純なファクト質問の回答ベンチマークで印象的な精度を示していますが、単純なクエリと複雑なクエリの間の信頼性のギャップは正しく認識されず、信頼性が低下します。本研究では、SLAQ（Short-Long Form Alignment for Factual Question Answering）という評価フレームワークを導入し、LLMの回答を（a）独立した質問（short）と（b）複雑な質問に統合した場合（long）と比較する。 600個のクエリに対して16個のLLMを分析した結果、shortクエリとlongクエリに対する回答で体系的な不一致を発見した。さらに、位置依存的精度の損失と連続した正解または誤解が自己強化パターンを生成する運動効果を明らかにした。メカニズム分析によって、ソートされた事実が重複するモデルの内部を活性化し、メカニズム類似性に基づく指標が最大78％の精度で短い回答のソートを予測できることを確認しました。本研究は、問合せの複雑さによる事実的一貫性をLLMの信頼性の重要な側面として確立し、単純なファクト問合せに対する優れた性能が、より複雑な知識探求作業でも信頼性を意味するという暗黙的な仮定を持つ現在の評価方式を批判する。

Takeaways、Limitations

•

Takeaways:

◦

LLMのファクト知識アプローチの一貫性の欠如を確認

◦

SLAQフレームワークを介したクエリの複雑さによるLLMの信頼性を評価する新しい方法論を提示します。

◦

メカニズム解析による回答アライメントを予測する指標の開発

◦

LLM評価方式の改善の必要性を提起。

•

Limitations:

◦

研究に使用されたモデルとクエリの限られた範囲。

◦

メカニズム分析のさらなる探求の必要性。

◦

SLAQフレームワークの一般化可能性検証が必要です。

Made with Slashpage