Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Power Stabilization for AI Training Datacenters

A Systematic Study of Deep Learning Models and xAI Methods for Region-of-Interest Detection in MRI Scans

Documenting Deployment with Fabric: A Repository of Real-World AI Governance

Surya: Foundation Model for Heliophysics

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets

FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering

VerilogLAVD: LLM-Aided Rule Generation for Vulnerability Detection in Verilog

Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair

SecFSM: Knowledge Graph-Guided Verilog Code Generation for Secure Finite State Machines in Systems-on-Chip

Fortifying the Agentic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats

LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Preacher: Paper-to-Video Agentic System

Agoran: An Agentic Open Marketplace for 6G RAN Automation

Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

IBPS: Indian Bail Prediction System

Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

TS-Insight: Visualizing Thompson Sampling for Verification and XAI

When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Generation of structure-guided pMHC-I libraries using Diffusion Models

Cross-Modality Masked Learning for Survival Prediction in ICI Treated NSCLC Patients

MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation

KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis

Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis

Deep regularization networks for inverse problems with noisy operators

LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Lossless Token Sequence Compression via Meta-Tokens

Versatile Cardiovascular Signal Generation with a Unified Diffusion Transformer

Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning

Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

CaRL: Learning Scalable Planning Policies with Simple Rewards

On the Consistency of GNN Explanations for Malware Detection

Cequel: Cost-Effective Querying of Large Language Models for Text Clustering

Kuwain 1.5B: An Arabic SLM via Language Injection

MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos

TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Embodied Long Horizon Manipulation with Closed-loop Code Generation and Incremental Few-shot Adaptation

Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm

A Case for Specialisation in Non-Human Entities

Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language

Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection

Innamark: A Whitespace Replacement Information-Hiding Method

Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering

RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation

Setup Once, Secure Always: A Single-Setup Secure Federated Learning Aggregation Protocol with Forward and Backward Secrecy for Dynamic Users

Self-Supervised Prompt Optimization

Learning to Generate Unit Tests for Automated Debugging

Modeling Discrimination with Causal Abstraction

Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Knowledge-Guided Prompt Learning for Request Quality Assurance in Public Code Review

Fine-tuning foundational models to code diagnoses from veterinary health records

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Continual Learning for Multimodal Data Fusion of a Soft Gripper

BoostTrack++: using tracklet information to detect more objects in multiple object tracking

OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data

CREMA: A Contrastive Regularized Masked Autoencoder for Robust ECG Diagnostics across Clinical Domains

Generating 3D Terrain with 2D Cellular Automata

Unplug and Play Language Models: Decomposing Experts in Language Models at Inference Time

Using a cognitive architecture to consider antiBlackness in design and development of AI systems

ITL-LIME: Instance-Based Transfer Learning for Enhancing Local Explanations in Low-Resource Data Settings

ThinkTuning: Instilling Cognitive Reflections without Distillation

A "good regulator theorem" for embodied agents

Prescriptive Agents based on RAG for Automated Maintenance (PARAM)

One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning

Opus: A Prompt Intention Framework for Complex Workflow Generation

Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy

Automatic Curriculum Design for Zero-Shot Human-AI Coordination

PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data

SycEval: Evaluating LLM Sycophancy

CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare

On Learning Action Costs from Input Plans

Human-Object Interaction from Human-Level Instructions

Non-linear Welfare-Aware Strategic Learning

CRISPR-GPT for Agentic Automation of Gene-editing Experiments

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Neural Robot Dynamics

Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

"Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Numerical models outperform AI weather forecasts of record-breaking extremes

EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

Tutorial on the Probabilistic Unification of Estimation Theory, Machine Learning, and Generative AI

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Learning to Generate Unit Tests for Automated Debugging

Created by

Haebom

저자

Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal

개요

본 논문은 오류를 드러내는 단위 테스트 입력값을 생성하는 동시에 정답 없이 단위 테스트 출력값을 정확하게 예측하는 데 있어 상충관계가 있음을 밝힙니다. 이를 해결하기 위해, 과제 설명을 기반으로 오류를 드러내는 단위 테스트 입력값과 올바른 예상 출력값을 생성하도록 LLMs을 학습시키는 UTGen을 제안합니다. 모델이 생성한 테스트는 노이즈가 포함될 수 있으므로, UTDebug를 통해 테스트 시간 계산을 활용하여 UT 출력 예측을 개선하고, 여러 생성된 UT를 기반으로 편집을 검증하고 되돌아가 과적합을 방지하며, LLM의 디버깅을 효과적으로 지원합니다. 실험 결과, UTGen은 오류를 드러내는 UT 입력과 정확한 UT 출력 모두를 측정하는 지표에서 다른 LLM 기반 기준 모델보다 7.59% 향상된 성능을 보였습니다. UTDebug와 함께 사용하면 HumanEvalFix와 MBPP+의 더 어려운 디버깅 분할에서 Qwen2.5 32B의 pass@1 정확도를 다른 LLM 기반 UT 생성 기준 모델보다 각각 3.17%와 12.35% 이상 향상시켰습니다. 또한 Qwen2.5 32B 기반 UTGen 모델의 피드백은 GPT-4o와 같은 최첨단 LLM의 디버깅을 13.8% 향상시켰습니다. 마지막으로 UTGen은 HumanEval+에서 최고의 10개 샘플링을 사용하는 Qwen2.5 7B를 사용하여 최첨단 8B 보상 모델보다 4.43% 우수한 코드 정확성 판단 모델임을 보여줍니다.

시사점, 한계점

•

시사점:

◦

오류를 드러내는 단위 테스트 입력값 생성과 정확한 출력값 예측 간의 상충관계를 해결하는 새로운 방법 제시

◦

UTGen과 UTDebug를 통해 LLM 기반 단위 테스트 생성 및 디버깅 성능 향상

◦

LLM의 코드 정확성 판단 능력 향상에 기여

◦

최첨단 LLM의 디버깅 성능 향상에 기여

•

한계점:

◦

UTGen 및 UTDebug의 성능 향상은 특정 LLM(Qwen2.5) 및 데이터셋에 의존적일 수 있음. 다른 LLM 및 데이터셋에서의 일반화 성능에 대한 추가 연구 필요.

◦

복잡한 코드에 대한 단위 테스트 생성 및 디버깅 성능 평가 필요.

◦

UTDebug의 과적합 방지 전략의 효율성에 대한 추가 분석 필요.

◦

대규모 코드베이스에 대한 적용성 및 확장성 평가 필요.

Made with Slashpage