Daily Arxiv

世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。

Language Models are Injective and Hence Invertible

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Latent Diffusion Model without Variational Autoencoder

Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Phenome-Wide Multi-Omics Integration Uncovers Distinct Archetypes of Human Aging

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

A Vision for Access Control in LLM-based Agent Systems

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Formally Verified Certification of Unsolvability of Temporal Planning Problems

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Online automatic code generation for robot swarms: LLMs and self-organizing hierarchy

A New Digital Divide? Coder Worldviews, the Slop Economy, and Democracy in the Age of AI

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Creative synthesis of kinematic mechanisms

Market-Driven Subset Selection for Budgeted Training

Mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

TimeEmb: A Lightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting

Learning Generalizable Shape Completion with SIM(3) Equivariance

Dolphin v1.0 Technical Report

A Measurement Study of Model Context Protocol Ecosystem

Diffusion Models are Kelly Gamblers

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Semantic Representation Attack against Aligned Large Language Models

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Accurate and Efficient Low-Rank Model Merging in Core Space

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

Graph Coloring for Multi-Task Learning

Robust LLM Training Infrastructure at ByteDance

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Communications to Circulations: Real-Time 3D Wind Field Prediction Using 5G GNSS Signals and Deep Learning

Why and How Auxiliary Tasks Improve JEPA Representations

Creativity Benchmark: A benchmark for marketing creativity for large language models

SpikingBrain: Spiking Brain-inspired Large Models

Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection

Epistemic Trade-Off: An Analysis of the Operational Breakdown and Ontological Limits of "Certainty-Scope" in AI

ZeST: an LLM ベースの Zero-Shot Traversability Navigation for Unknown Environments

Interpretable Decision-Making for End-to-End Autonomous Driving

A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs

Limitations of Normalization in Attention Mechanism

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

The GPT-4o Shock Emotional Attachment to AI Models and Its Impact on Regulatory Acceptance: A Cross-Cultural Analysis of the Immediate Transition from GPT-4o to GPT-5

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Objectc-Centric Representations from Pretrained Vision Models

VGGSounder: Audio-Visual Evaluations for Foundation Models

Evolution of AI Agent Registry Solutions: Centralized, Enterprise, and Distributed Approaches

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models

ReDi: Rectified Discrete Flow

Adaptive Policy Synchronization for Scalable Reinforcement Learning

From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Multimodal Fusion at Three Tiers: Physics-Driven Data Generation and Vision-Language Guidance for Brain Tumor Segmentation

Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

AI-Generated Video Detection via Perceptual Straightening

From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning

ADA-DPM: A Neural Descriptors-based Adaptive Noise Filtering Strategy for SLAM

GeNIE: A Generalizable Navigation System for In-the-Wild Environments

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Code Execution as Grounded Supervision for LLM Reasoning

Denoising the Future: Top-p Distributions for Moving Through Time

HauntAttack: When Attack Follows Reasoning as a Shadow

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

VERINA: Benchmarking Verifiable Code Generation

RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

The quest for the GRAph Level autoEncoder (GRALE)

Efficient Large Language Model Inference with Neural Block Linearization

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

VERINA: Benchmarking Verifiable Code Generation

Created by

Haebom

作者

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

概要

大規模言語モデル（LLM）はソフトウェア開発に広く使用されていますが、LLMによって生成されたコードの正確さを保証することは依然として困難であり、しばしば高価な手動レビューが必要です。検証可能なコード生成は、コード、仕様、およびコード仕様のソート証明をまとめて生成し、この問題を解決し、コーディングにおけるLLMの利点をさらに活用するための有望な方法を提供します。しかし、現在のベンチマークは、すべてのタスクを包括的に評価するフレームワークを提供するのではなく、個々のコンポーネントにのみ焦点を当てるなど、評価に大きなギャップがあります。この論文では、コード、仕様、および証明の生成とその構成を包括的に評価できる高品質のベンチマークであるVerina（Verifiable Code Generation Arena）を紹介します。 VerinaはLeanで189の手動でキュレーションされたコーディングタスクで構成されています。最先端のLLMの広範な評価は、検証可能なコード生成、特に証明生成において深刻な課題を明らかにし、検証ドメインでLLMベースのクリーンアップ証明書を改善する必要性を強調しています。最も優れたモデルであるOpenAI o4-miniは、61.4％のコード精度、51.0％の仕様の健全性と完全性、および3.6％の証明の成功率を達成します（作業ごとに1回の試行基準）。 Verinaは厳格で包括的なベンチマークを提供することで、検証可能なコード生成の進歩を促進すると期待しています。データセットはhttps://huggingface.co/datasets/sunblaze-ucb/verina에서 、評価コードはhttps://github.com/sunblaze-ucb/verina에서公開します。

Takeaways、Limitations

•

Takeaways：

◦

Verinaベンチマークは、コード、仕様、証明の生成、およびその構成を包括的に評価するための新しいツールを提供します。

◦

ＬＬＭベースの検証可能なコード生成の問題を明らかにし、特に証明生成分野の改善の必要性を強調する。

◦

研究コミュニティがその分野の発展を加速するための厳密で包括的なベンチマークを提供します。

◦

オープンソースのデータセットと評価コードを提供し、研究の再現性と拡張をサポートします。

•

Limitations：

◦

最も優れたモデル（OpenAI o4-mini）の証明成功率は非常に低く、証明生成技術の発展が緊急であることを示しています。

◦

評価結果は特定のモデルと設定によって異なり、一般化には限界があります。

◦

ベンチマークの作業はLean言語に制限されており、他のプログラミング言語や環境への拡張が必要です。

◦

本論文はVerinaのベンチマークを提示することに焦点を当てており、LLMのパフォーマンスを向上させるための具体的な方法論的改善を示していません。

Made with Slashpage