Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

AutoChemSchematic AI: A Closed-Loop, Physics-Aware Agentic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

DLP: Dynamic Layerwise Pruning in Large Language Models

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Mind the Gap: A Practical Attack on GGUF Quantization

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Cognitive Guardrails for Open-World Decision Making in Autonomous Drone Swarms

SWE-bench Goes Live!

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Matryoshka Model Learning for Improved Elastic Student Models

Context-Robust Knowledge Editing for Language Models

HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding

FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

iDSE: Navigating Design Space Exploration in High-Level Synthesis Using LLMs

Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection

RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Universal Value-Function Uncertainties

How Do Transformers Learn Variable Binding in Symbolic Programs?

Adversarial bandit optimization for approximately linear functions

Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Homophily Enhanced Graph Domain Adaptation

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

An Interpretable Representation Learning Approach for Diffusion Tensor Imaging

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

Security Concerns for Large Language Models: A Survey

A Survey of LLM $\times$ DATA

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

NMCSE: Noise-Robust Multi-Modal Coupling Signal Estimation Method via Optimal Transport for Cardiovascular Disease Detection

FRIREN: Beyond Trajectories -- A Spectral Lens on Time

A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Replay Attacks Against Audio Deepfake Detection

Forensic deepfake audio detection using segmental speech features

A3 : an Analytical Low-Rank Approximation Framework for Attention

A Survey of 3D Reconstruction with Event Cameras

Confabulation dynamics in a reservoir computer: Filling in the gaps with untrained attractors

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Motion-compensated cardiac MRI using low-rank diffeomorphic flow (DMoCo)

OODTE: A Differential Testing Engine for the ONNX Optimizer

Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

Depth-Constrained ASV Navigation with Deep RL and Limited Sensing

(Im)possibility of Automated Hallucination Detection in Large Language Models

Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward

Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research

The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs

The Structural Safety Generalization Problem

Parameterized Synthetic Text Generation with SimpleStories

SD$^2$: Self-Distilled Sparse Drafters

Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

Semantic-guided Representation Learning for Multi-Label Recognition

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates

Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models

ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems

A Survey on Event-driven 3D Reconstruction: Development under Different Categories

VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations

REALM: A Dataset of Real-World LLM Use Cases

Opportunities and Challenges of Frontier Data Governance With Synthetic Data

ARFlow: Human Action-Reaction Flow Matching with Physical Guidance

Position: Beyond Assistance - Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care

Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection

A Dual-Directional Context-Aware Test-Time Learning for Text Classification

MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance

ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection

NFIG: Autoregressive Image Generation with Next-Frequency Prediction

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization

Wanda++: Pruning Large Language Models via Regional Gradients

HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Optimizing Multi-Hop Document Retrieval Through Intermediate Representations

Causally Reliable Concept Bottleneck Models

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models

Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models

POPGym Arcade: Parallel Pixelated POMDPs

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Quantifying First-Order Markov Violations in Noisy Reinforcement Learning: A Causal Discovery Approach

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases

Faithful Logic Embeddings in HOL -- Deep and Shallow

TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Created by

Haebom

저자

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, Cuiyun Gao

개요

본 논문은 실행 가능한 코드 데이터의 확장성이 언어 모델의 소프트웨어 엔지니어링 능력 향상에 중요함을 강조합니다. 기존 연구에서 실행 테스트를 기반으로 대규모 실행 가능 코드 저장소를 구축하는 것은 노동 집약적이고 시간이 많이 걸리며 전문 지식에 의존하는 어려움이 있었습니다. 본 논문은 이러한 어려움의 주요 원인이 다양한 저장소에 대한 테스트 환경의 자동 구축에 있음을 지적하고, 이 문제를 해결하기 위해 Repo2Run을 제시합니다. Repo2Run은 LLM 기반 에이전트로, 다양한 저장소에 대한 실행 가능한 테스트 환경 구축을 자동화하는 것을 목표로 합니다. Repo2Run은 Docker 이미지를 반복적으로 구축하고, 구축 피드백을 기반으로 단위 테스트를 실행하며, Dockerfile을 합성하여 전체 파이프라인이 성공적으로 실행될 때까지 작업을 수행합니다. 420개의 Python 저장소를 포함하는 벤치마크를 사용하여 평가한 결과, Repo2Run은 86.0%의 성공률을 달성하여 기존 SWE-agent보다 77.0% 향상된 성능을 보였습니다. Repo2Run의 리소스는 GitHub에서 공개되었습니다.

시사점, 한계점

•

시사점:

◦

LLM 기반 자동화 에이전트를 통해 대규모 실행 가능 코드 데이터 확보의 효율성을 크게 향상시킬 수 있음을 보여줌.

◦

소프트웨어 엔지니어링 분야에서 LLM의 활용 가능성을 제시하고, 언어 모델의 소프트웨어 엔지니어링 능력 향상에 기여.

◦

Repo2Run의 성공적인 구현 및 성능 향상은 향후 유사한 시스템 개발에 대한 중요한 지침을 제공.

•

한계점:

◦

현재 Python 저장소에 대한 평가만 수행되어 다른 프로그래밍 언어에 대한 일반화 가능성은 추가 연구가 필요.

◦

벤치마크 데이터셋의 규모가 상대적으로 작아 더욱 대규모 데이터셋을 이용한 검증이 필요.

◦

복잡한 의존성이나 특수한 환경 설정이 필요한 저장소에 대한 처리 성능은 추가 개선이 필요할 수 있음.

◦

LLM 기반이므로 LLM의 한계(예: 환각)가 Repo2Run의 성능에 영향을 줄 수 있음.

Made with Slashpage