Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

GBPP: Grasp-Aware Base Placement Prediction for Robots via Two-Stage Learning

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

Physics-informed neural network solves minimal surfaces in curved spacetime

A funny companion: Distinct neural responses to perceived AI- versus human-generated humor

National Running Club Database: Assessing Collegiate Club Athletes' Cross Country Race Results

Online Learning Based Efficient Resource Allocation for LoRaWAN Network

MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization

Implicit Neural Representations of Intramyocardial Motion and Strain

MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

AI Governance in Higher Education: A course design exploring regulatory, ethical and practical considerations

Benchmarking Gender and Political Bias in Large Language Models

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Spiking Neural Networks for Continuous Control via End-to-End Model-Based Learning

ICR: Iterative Clarification and Rewriting for Conversational Search

ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Keypoint-based Diffusion for Robotic Motion Planning on the NICOL Robot

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

New Kid in the Classroom: Exploring Student Perceptions of AI Coding Assistants

Analysis of Fourier Neural Operators via Effective Field Theory

FCRF: Flexible Constructivism Reflection for Long-Horizon Robotic Task Planning with Large Language Models

PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training

Memorization Sinks: Isolating Memorization during LLM Training

Clue-RAG: Towards Accurate and Cost-Efficient Graph-based RAG via Multi-Partite Graph and Query-Driven Iterative Retrieval

OGF: An Online Gradient Flow Method for Optimizing the Statistical Steady-State Time Averages of Unsteady Turbulent Flows

AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models

Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN

Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights

Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

Counterfactual Simulatability of LLM Explanations for Generation Tasks

PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

HiLAB: A Hybrid Inverse-Design Framework

Tuning-Free LLM Can Build A Strong Recommender Under Sparse Connectivity And Knowledge Gap Via Extracting Intent

WaterFlow: Learning Fast & Robust Watermarks using Stable Diffusion

Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding

Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation

Teaching Your Models to Understand Code via Focal Preference Alignment

Investigating the use of terrain-following coordinates in AI-driven precipitation forecasts

SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

Safe Learning Under Irreversible Dynamics via Asking for Help

Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection

How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs

Pitfalls of defacing whole-head MRI: re-identification risk with diffusion models and compromised research potential

AI/ML Based Detection and Categorization of Covert Communication in IPv6 Network

Learn from Global Correlations: Enhancing Evolutionary Algorithm via Spectral GNN

Enhancing Automated Loop Invariant Generation for Complex Programs with Large Language Models

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints

Adversarial Prompt Distillation for Vision-Language Models

TrojanRobot: Physical-world Backdoor Attacks Against VLM-based Robotic Manipulation

The Belief State Transformer

A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data

Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation

Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

RingMo-Aerial: An Aerial Remote Sensing Foundation Model With Affine Transformation Contrastive Learning

Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning

Informed Correctors for Discrete Diffusion Models

EMOE: A Framework for Out-of-distribution Uncertainty Based Rejection via Model-Agnostic Expansive Matching of Experts

Empowering Time Series Analysis with Foundation Models: A Comprehensive Survey

Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions

Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation

When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models

Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture

Explaining Tournament Solutions with Minimal Supports

Neuromorphic Computing with Multi-Frequency Oscillations: A Bio-Inspired Approach to Artificial Intelligence

TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems

Small Language Models are the Future of Agentic AI

Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success

Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning

Robust Decision-Making Via Free Energy Minimization

CredID: Credible Multi-Bit Watermark for Large Language Models Identification

Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge

Overcoming classic challenges for artificial neural networks by providing incentives and practice

Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity

Concurrent Linguistic Error Detection (CLED): a New Methodology for Error Detection in Large Language Models

Contrastive timbre representations for musical instrument and synthesizer retrieval

HARMONIC: A Content-Centric Cognitive Robotic Architecture

RadGame: An AI-Powered Platform for Radiology Education

JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Created by

Haebom

저자

Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, Joao P. Matos-Carvalho

개요

본 논문은 최첨단 대규모 언어 모델(LLM)들이 과학 연구에서 코드 생성 자동화 도구로서 빠르게 발전하고 있지만, 복잡한 계산 실험을 위한 익숙하지 않은 Python API를 해석하고 사용하는 능력은 아직 제대로 특징이 규명되지 않았다는 점을 다룹니다. 두 가지 점점 더 어려워지는 시나리오(ParShift 라이브러리를 사용한 대화형 데이터 분석, pyclugen 및 scikit-learn을 사용한 합성 데이터 생성 및 클러스터링)에서 기능적인 Python 코드 생성에 대한 최첨단 LLM들을 체계적으로 벤치마킹합니다. 구조화된 제로샷 프롬프트를 사용하여 세부 요구 사항을 명시하지만, 컨텍스트 내 예시는 생략합니다. 모델 출력은 여러 번 실행에 걸쳐 기능적 정확성과 프롬프트 준수 여부를 정량적으로 평가하고, 코드 실행이 실패할 때 발생하는 오류를 분석하여 정성적으로 평가합니다. 결과는 소수의 모델만이 일관되게 정확하고 실행 가능한 코드를 생성한다는 것을 보여줍니다. GPT-4.1은 두 실험 과제 모두에서 모든 실행에서 100% 성공률을 달성한 반면, 다른 대부분의 모델은 절반 미만의 실행에서 성공했으며, Grok-3과 Mistral-Large만이 비슷한 성능에 근접했습니다. LLM 성능 벤치마킹 외에도 이러한 접근 방식은 명확하지 않은 설명서나 모호한 구현 버그와 같은 타사 라이브러리의 단점을 파악하는 데 도움이 됩니다. 전반적으로 이러한 결과는 엔드투엔드 과학 자동화에 대한 LLM의 현재 한계를 강조하고, 신중한 프롬프트 설계, 포괄적인 라이브러리 설명서 및 언어 모델 기능의 지속적인 발전의 필요성을 강조합니다.

시사점, 한계점

•

시사점:

◦

GPT-4.1을 포함한 일부 LLM은 복잡한 과학적 계산을 위한 코드 생성에 상당한 성능을 보임을 확인했습니다.

◦

LLM 성능 벤치마킹을 통해 타사 라이브러리의 문서화 및 구현 문제를 식별하는 데 도움이 됩니다.

◦

엔드투엔드 과학 자동화를 위한 LLM의 현재 한계와 프롬프트 엔지니어링, 라이브러리 문서화 및 모델 향상의 중요성을 강조합니다.

•

한계점:

◦

제한된 수의 LLM과 라이브러리만을 사용하여 벤치마킹을 수행했습니다.

◦

제로샷 프롬프트만을 사용하여 컨텍스트 내 학습의 효과를 고려하지 않았습니다.

◦

평가는 기능적 정확성에 중점을 두었으며, 코드의 효율성이나 스타일은 고려되지 않았습니다.

◦

더욱 다양하고 복잡한 과학적 작업에 대한 추가적인 연구가 필요합니다.

Made with Slashpage