Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Emotionally Vulnerable Subtype of Internet Gaming Disorder: Measuring and Exploring the Pathology of Problematic Generative AI Use

Explaining raw data complexity to improve satellite onboard processing

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training

Provable Speech Attributes Conversion via Latent Independence

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Paper2Video: Automatic Video Generation from Scientific Papers

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

Generalized Orders of Magnitude for Scalable, Parallel, High-Dynamic-Range Computation

LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Learning to Reason for Hallucination Span Detection

Panorama: Fast-Track Nearest Neighbors

Feature Identification via the Empirical NTK

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Real-time Noise Detection and Classification in Single-Channel EEG: A Lightweight Machine Learning Approach for EMG, White Noise, and EOG Artifacts

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

MORPH: Shape-agnostic PDE Foundation Models

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

Reproducible workflow for online AI in digital health

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

A Survey of Reinforcement Learning for Large Reasoning Models

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

Scaling Performance of Large Language Model Pretraining

Towards Methane Detection Onboard Satellites

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

Long Chain-of-Thought Reasoning Across Languages

MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

CoCoA: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

Understanding Teen Overreliance on AI Companion Chatbots Through Self-Reported Reddit Narratives

ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Truth, Trust, and Trouble: Medical AI on the Edge

LLMs on a Budget? Say HOLA

The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models

A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Rethinking Losses for Diffusion Bridge Samplers

Think With Videos For Agentic Long-Video Understanding

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Intention-Conditioned Flow Occupancy Models

Product of Experts for Visual Generation

Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Tug-of-war between idioms' figurative and literal interpretations in LLMs

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Inference-time Alignment in Continuous Space

STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation

Hakim: Farsi Text Embedding Model

Understanding In-context Learning of Addition via Activation Subspaces

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Created by

Haebom

저자

Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

개요

긴 형식의 비디오 데이터는 매우 밀집되고 고차원적입니다. 비디오 내용에 대한 텍스트 기반 요약은 원시 비디오보다 훨씬 더 간결한 방식으로 쿼리 관련 내용을 표현하는 방법을 제공합니다. 또한 텍스트 표현은 최첨단 대규모 언어 모델(LLM)에서 쉽게 처리할 수 있으며, 이를 통해 복잡한 자연어 쿼리에 답하기 위해 비디오 내용에 대한 추론이 가능합니다. 이 문제를 해결하기 위해, 우리는 시공간적 모델링이 계산적으로 가능한 더 짧은 비디오 청크에서 작동하는 비디오 캡셔너에 의해 텍스트 기반 메모리를 점진적으로 구축하는 데 의존합니다. 우리는 짧은 비디오 캡션으로 구성된 활동 로그의 품질을 향상시키는 방법을 탐구합니다. 비디오 캡션은 주로 인간의 행동에 초점을 맞추는 경향이 있으며, 질문은 장면의 다른 정보와 관련될 수 있으므로, 우리는 Vision Language Models (VLM)을 사용하여 정적 장면 설명을 메모리에 추가하고자 합니다. 우리의 비디오 이해 시스템은 LaViLa 비디오 캡셔너를 LLM과 결합하여 비디오에 대한 질문에 답합니다. 우리는 먼저 비디오 내용의 구조를 보다 정확하게 반영하도록 비디오를 의미 있는 세그먼트로 분할하는 다양한 방법을 탐구했습니다. 또한, LLaVA VLM을 사용하여 정적 장면 설명을 캡셔닝 파이프라인에 통합하여, 더욱 상세하고 완전한 캡션 로그를 얻고 텍스트 메모리에서 답변할 수 있는 질문의 범위를 확장했습니다. 마지막으로, 우리는 LaViLa 비디오 캡셔너를 미세 조정하여 동작 및 장면 캡션을 모두 생성하는 데 성공했으며, 두 작업에 대해 별도의 캡셔닝 모델을 사용하는 것에 비해 캡셔닝 파이프라인의 효율성을 크게 향상시켰습니다. 우리의 모델, 제어 가능한 하이브리드 캡셔너는 비디오에서 감지된 장면 변화를 알리는 특수 입력 토큰에 따라 다른 유형의 캡션을 번갈아 사용할 수 있습니다.

시사점, 한계점

•

비디오 내용에 대한 텍스트 기반 요약을 생성하여 LLM을 통해 복잡한 질의에 응답할 수 있도록 함.

•

LaViLa 캡셔너를 사용하여 비디오를 캡션화하고, VLM을 통해 정적 장면 정보를 추가하여 캡션의 정확성과 완전성을 높임.

•

행동 및 장면 캡션을 모두 생성하도록 LaViLa 캡셔너를 미세 조정하여 캡셔닝 파이프라인의 효율성을 향상시킴.

•

제어 가능한 하이브리드 캡셔너를 통해 장면 변화에 따라 다른 유형의 캡션을 생성할 수 있도록 함.

•

제한 사항은 구체적으로 언급되지 않음.

Made with Slashpage