Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Semantic Content Determines Algorithmic Performance

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Chain Of Thought Compression: A Theoritical Analysis

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

Meta Context Engineering via Agentic Skill Evolution

ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory

ARGORA: Orchestrated Argumentation for Causally Grounded LLM Reasoning and Decision Making

KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

LLaMEA-SAGE: Guiding Automated Algorithm Design with Structural Feedback from Explainable AI

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

MAR: Efficient Large Language Models via Module-aware Architecture Refinement

The Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus

ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance

LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models

System 1&2 Synergy via Dynamic Model Interpolation

DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

Hebbian Learning with Global Direction

Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents

Dynamic Framework for Collaborative Learning: Leveraging Advanced LLM with Adaptive Feedback Mechanisms

Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

EHR-RAG: Bridging Long-Horizon Structured Electronic Health Records and Large Language Models via Enhanced Retrieval-Augmented Generation

Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks

Modeling Endogenous Logic: Causal Neuro-Symbolic Reasoning Model for Explainable Multi-Behavior Recommendation

White-Box Op-Amp Design via Human-Mimicking Reasoning

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Position: Certifiable State Integrity in Cyber-Physical Systems -- Why Modular Sovereignty Solves the Plasticity-Stability Paradox

TIDE: Tuning-Integrated Dynamic Evolution for LLM-Based Automated Heuristic Design

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Delegation Without Living Governance

Causal Discovery for Explainable AI: A Dual-Encoding Approach

Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning

Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning

Do Reasoning Models Enhance Embedding Models?

Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving

Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning

BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding

What You Feel Is Not What They See: On Predicting Self-Reported Emotion from Third-Party Observer Labels

Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

CUA-Skill: Develop Skills for Computer Using Agent

Planner-Auditor Twin: Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement

How does information access affect LLM monitors' ability to detect sabotage?

Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve

Responsible AI: The Good, The Bad, The AI

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Multi-modal Imputation for Alzheimer's Disease Classification

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

QUARK: Robust Retrieval under Non-Faithful Queries via Query-Anchored Aggregation

Unplugging a Seemingly Sentient Machine Is the Rational Choice -- A Metaphysical Perspective

Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models

The Epistemic Planning Domain Definition Language: Official Guideline

Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review

Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

CiMRAG: Cim-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLMs

Structural Compositional Function Networks: Interpretable Functional Compositions for Tabular Discovery

LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

Do we really need Self-Attention for Streaming Automatic Speech Recognition?

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Probabilistic Sensing: Intelligence in Data Sampling

LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegespr\"achen

Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Quantifying non deterministic drift in large language models

Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

SDUs DAISY: A Benchmark for Danish Culture

Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study

GTAC: A Generative Transformer for Approximate Circuits

DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs

STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification

Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

Created by

Haebom

Category

Empty

저자

Paul He, Yinya Huang, Mrinmaya Sachan, Zhijing Jin

💡 개요

본 논문은 대규모 언어 모델(LLM)의 인과 추론 능력을 평가하기 위한 새로운 접근법을 제시합니다. 기존 평가 방법의 한계를 극복하고자, LLM이 생성한 인과 표현식이 주어진 인과 그래프로부터 do-calculus 및 확률 이론 규칙에 따라 도출될 수 있는지 검증하는 기호 검증기 'DoVerifier'를 제안합니다. DoVerifier는 표면적인 차이로 인해 틀렸다고 간주될 수 있는 정답을 의미론적으로 올바르게 복구하여, LLM의 인과 추론 능력을 보다 엄격하고 정확하게 평가할 수 있음을 보여줍니다.

🔑 시사점 및 한계

•

LLM 기반 인과 추론 능력 평가의 정확성과 신뢰성을 향상시킬 수 있습니다.

•

단순한 문자열 매칭을 넘어, 인과 추론의 형식적 유효성을 검증하는 새로운 표준을 제시합니다.

•

제안된 검증기는 합성 데이터와 기존 벤치마크에서 LLM의 의미론적 정확성을 더 잘 포착합니다.

•

현재는 주어진 인과 그래프에 대한 검증에 초점을 맞추고 있으며, LLM이 스스로 인과 그래프를 생성하거나 수정하는 능력에 대한 평가는 추가적인 연구가 필요합니다.

Made with Slashpage