Daily Arxiv

世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。

Language Models are Injective and Hence Invertible

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Latent Diffusion Model without Variational Autoencoder

Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Phenome-Wide Multi-Omics Integration Uncovers Distinct Archetypes of Human Aging

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

A Vision for Access Control in LLM-based Agent Systems

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Formally Verified Certification of Unsolvability of Temporal Planning Problems

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Online automatic code generation for robot swarms: LLMs and self-organizing hierarchy

A New Digital Divide? Coder Worldviews, the Slop Economy, and Democracy in the Age of AI

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Creative synthesis of kinematic mechanisms

Market-Driven Subset Selection for Budgeted Training

Mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

TimeEmb: A Lightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting

Learning Generalizable Shape Completion with SIM(3) Equivariance

Dolphin v1.0 Technical Report

A Measurement Study of Model Context Protocol Ecosystem

Diffusion Models are Kelly Gamblers

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Semantic Representation Attack against Aligned Large Language Models

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Accurate and Efficient Low-Rank Model Merging in Core Space

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

Graph Coloring for Multi-Task Learning

Robust LLM Training Infrastructure at ByteDance

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Communications to Circulations: Real-Time 3D Wind Field Prediction Using 5G GNSS Signals and Deep Learning

Why and How Auxiliary Tasks Improve JEPA Representations

Creativity Benchmark: A benchmark for marketing creativity for large language models

SpikingBrain: Spiking Brain-inspired Large Models

Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection

Epistemic Trade-Off: An Analysis of the Operational Breakdown and Ontological Limits of "Certainty-Scope" in AI

ZeST: an LLM ベースの Zero-Shot Traversability Navigation for Unknown Environments

Interpretable Decision-Making for End-to-End Autonomous Driving

A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs

Limitations of Normalization in Attention Mechanism

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

The GPT-4o Shock Emotional Attachment to AI Models and Its Impact on Regulatory Acceptance: A Cross-Cultural Analysis of the Immediate Transition from GPT-4o to GPT-5

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Objectc-Centric Representations from Pretrained Vision Models

VGGSounder: Audio-Visual Evaluations for Foundation Models

Evolution of AI Agent Registry Solutions: Centralized, Enterprise, and Distributed Approaches

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models

ReDi: Rectified Discrete Flow

Adaptive Policy Synchronization for Scalable Reinforcement Learning

From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation

Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

AI-Generated Video Detection via Perceptual Straightening

From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning

ADA-DPM: A Neural Descriptors-based Adaptive Noise Filtering Strategy for SLAM

GeNIE: A Generalizable Navigation System for In-the-Wild Environments

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Code Execution as Grounded Supervision for LLM Reasoning

Denoising the Future: Top-p Distributions for Moving Through Time

HauntAttack: When Attack Follows Reasoning as a Shadow

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

VERINA: Benchmarking Verifiable Code Generation

RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

The quest for the GRAph Level autoEncoder (GRALE)

Efficient Large Language Model Inference with Neural Block Linearization

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Created by

Haebom

作者

Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot

概要

本論文では、大規模な言語モデル（LLM）がロールプレイエージェントとして使用されている場合、特定のバージョンのキャラクター（漫画や映画の世界観のスーパーヒーローなど）を忠実かつ一貫して描写する能力を探ります。マーベルやDCなどのスーパーヒーローの世界観は豊富なテストベッドを提供し、同じキャラクターの複数の化身が異なる歴史、価値観、道徳的コードを持っています。これを研究するために、30人の象徴的な英雄と90個の特定の世界観バージョンを包括するキャラクターベースのロール劇のベンチマーク「Beyond One World」を紹介します。このベンチマークは、(i)主要な生活の段階をリアルに記憶するかどうかをテストする「Canon Events」と、(ii)モデルに倫理的に困難なシナリオを提示する「Moral Dilemmas」の2つの課題で構成されています.内部的熟考（「thinking」）と外部的行動（「acting」）を分離して、応答を正式な精度と推論忠実度で評価します。さらに、モデル信頼性の指標として使用される理由と行動の間のアラインメントを定量化する「Think-Act Matching」指標を提案します。推論指向および非推論指向モデルを対象とした実験を通じて、(1)事故の連鎖プロンプティングが弱いモデルではナラティブ一貫性を向上させるが、強いモデルでは正式な精度を減少させることができ、(2)キャラクター内でバージョン間一般化が依然として主な課題であり、(3)モデルが事故または行動の一つ得ました。「Beyond One World」は、複数の宇宙の一貫性と推論アライメントの重要なギャップを明らかにし、ロールプレイングLLMのための困難な評価を提供します。

Takeaways、Limitations

•

Takeaways:

◦

大規模言語モデルの役割劇能力評価のための新しいベンチマーク「Beyond One World」を提示。

◦

キャラクターの様々なバージョンを一貫して描写する能力の重要性を強調。

◦

事故と行動間のアライメントを測定する新しい指標「Think-Act Matching」提案

◦

Chain-of-thoughtプロンプトの効果と限界分析

•

Limitations：

◦

キャラクター間のバージョン内一般化が依然として難しい課題。

◦

モデルが事故と行動の両方で優れた能力を示す場合が珍しい。

◦

「Beyond One World」ベンチマークが特定の世界観とキャラクターに限定される。

Made with Slashpage