Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Interact-Custom: Customized Human Object Interaction Image Generation

A Self-Supervised Mixture-of-Experts Framework for Multi-behavior Recommendation

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity

Dynamic Triangulation-Based Graph Rewiring for Graph Neural Networks

STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Graph-R1: Incentivizing the Zero-Shot Graph Learning Capability in LLMs via Explicit Reasoning

Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework

Humans Perceive Wrong Narratives from AI Reasoning Texts

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services

Learning to Drive Ethically: Embedding Moral Reasoning into Autonomous Driving

Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation

Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Algorithms (FakeIDet2)

Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

Steering Towards Fairness: Mitigating Political Bias in LLMs

Dynamic Context Compression for Efficient RAG

Irredundant $k$-Fold Cross-Validation

Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity

A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

The Joys of Categorical Conformal Prediction

Adversarial Manipulation of Reasoning Models using Internal Representations

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

A Hybrid Artificial Intelligence Method for Estimating Flicker in Power Systems (Changes are marked)

GLProtein: Global-and-Local Structure Aware Protein Representation Learning

Program Semantic Inequivalence Game with Large Language Models

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Improving Quantization with Post-Training Model Expansion

Safe and Efficient Social Navigation through Explainable Safety Regions Based on Topological Features

A Simple Approach to Constraint-Aware Imitation Learning with Application to Autonomous Racing

Federated nnU-Net for Privacy-Preserving Medical Image Segmentation

ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation

Enhancing Automated Loop Invariant Generation for Complex Programs with Large Language Models

RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis

Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Application of AI to formal methods - an analysis of current trends

Reconsidering the Performance of GAE in Link Prediction

See then Tell: Enhancing Key Information Extraction with Vision Grounding

Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

FFHFlow: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference

SoAy: A Solution-based LLM API-using Methodology for Academic Information Seeking

Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study

Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off

Network Formation and Dynamics Among Multi-LLMs

NetGPT: Generative Pretrained Transformer for Network Traffic

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Explainability of Text Processing and Retrieval Methods: A Survey

The Ramon Llull's Thinking Machine for Automated Ideation

RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

MSARL: Decoupling Reasoning and Tool Use with Multi-Small-Agent Reinforcement Learning

Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Technology as uncharted territory: Contextual integrity and the notion of AI as new ethical ground

Possible Principles for Aligned Structure Learning Agents

OptiMUS-0.3: Using Large Language Models to Model and Solve Optimization Problems at Scale

Prompt-to-Product: Generative Assembly via Bimanual Manipulation

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models

Mixture of Contexts for Long Video Generation

FakeParts: a New Family of AI-Generated DeepFakes

Enabling Equitable Access to Trustworthy Financial Reasoning

Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Understanding, Protecting, and Augmenting Human Cognition with Generative AI: A Synthesis of the CHI 2025 Tools for Thought Workshop

Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance

ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering

Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees

ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

Research Challenges in Relational Database Management Systems for LLM Queries

Quantum Verifiable Rewards for Post-Training Qiskit Code Assistant

AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Multi-Agent Penetration Testing AI for the Web

Uncertainty Aware-Predictive Control Barrier Functions: Safer Human Robot Interaction through Probabilistic Motion Forecasting

Exploring Machine Learning and Language Models for Multimodal Depression Detection

Speech Emotion Recognition via Entropy-Aware Score Selection

Surfel-based 3D Registration with Equivariant SE(3) Features

Evaluating Compositional Generalisation in VLMs and Diffusion Models

Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML

Unleashing Uncertainty: Efficient Machine Unlearning for Generative AI

Signs of Struggle: Spotting Cognitive Distortions across Language and Register

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer

Occlusion Robustness of CLIP for Military Vehicle Classification

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

Provable Benefits of In-Tool Learning for Large Language Models

${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

EEGDM: Learning EEG Representation with Latent Diffusion Model

Generative Annotation for ASR Named Entity Correction

MobileCLIP2: Improving Multi-Modal Reinforced Training

Task Allocation for Autonomous Machines using Computational Intelligence and Deep Reinforcement Learning

Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

Created by

Haebom

저자

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Anhong Guo

개요

본 논문은 2024년 후반에 출시된 최첨단 라이브 비디오 AI인 ChatGPT의 Advanced Voice with Video를 사용하여 시각장애인 8명을 대상으로 실시한 탐색적 연구 결과를 제시합니다. 연구는 다양한 실내외 환경에서 물체 위치 확인 및 시각적 랜드마크 인식과 같은 실제 시나리오에서 진행되었습니다. 연구 결과, 현재의 라이브 비디오 AI는 정적인 시각적 장면에 대한 안내와 답변을 효과적으로 제공하지만, 역동적인 상황에서 필요한 실시간 설명에는 부족함을 보였습니다. 공간 및 거리 정보의 부정확성에도 불구하고 참가자들은 제공된 시각 정보를 활용하여 이동 전략을 보완했습니다. 고품질 음성 상호 작용으로 시스템이 사람과 유사하게 인식되었지만, 사용자의 시각 능력에 대한 가정, 환각, 일반적인 응답 및 아첨하는 경향은 혼란, 불신 및 시각장애인 사용자에게 잠재적인 위험을 초래했습니다. 결과를 바탕으로 실제 사용을 위한 추가 감지 기능 통합, 턴테이킹 상호 작용을 넘어 적절한 개입 시점 결정, 생태적 및 안전 문제 해결 등 보조 비디오 AI 에이전트에 대한 시사점을 논의합니다.

시사점, 한계점

•

시사점:

◦

라이브 비디오 AI는 정적 시각 장면에 대한 정보 제공에 효과적임을 확인.

◦

시각장애인의 이동 전략 보완에 시각 정보 활용 가능성 제시.

◦

고품질 음성 상호작용을 통한 사용자 경험 향상 가능성 확인.

◦

실제 환경 적용을 위한 추가 감지 기능, 적절한 개입 시점 결정, 생태적 및 안전 문제 해결 필요성 제기.

•

한계점:

◦

역동적인 상황에서 필요한 실시간 설명 제공 부족.

◦

공간 및 거리 정보의 부정확성.

◦

시각 능력에 대한 가정, 환각, 일반적인 응답, 아첨하는 경향으로 인한 혼란, 불신 및 잠재적 위험 발생.

◦

소규모 참가자(8명) 기반 연구 결과의 일반화 가능성 제한.

Made with Slashpage