Daily Arxiv

New

This is a page that organizes artificial intelligence-related papers published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright for the paper belongs to the author and the relevant institution, and you only need to cite the source when sharing the abstract.
This service is supported by Google Gemini.

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Do Transformers Need Three Projections? Systematic Study of QKV Variants

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Argument Collapse: LLMs Flatten Long-Form Public Debate

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Extreme Region Policy Distillation

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Scalable Reinforcement Learning via Adaptive Batch Scaling

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Exact Linear Attention

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Fault tolerance estimation in digital circuits with visualized generative networks

Scaling few-shot spoken word classification with generative meta-continual learning

Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization

When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

Query-efficient model evaluation using cached responses

Towards an Inferentialist Account of Information Through Proof-theoretic Semantics

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

Learning to Theorize the World from Observation

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

RAT: RunAnyThing via Fully Automated Environment Configuration

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Image Generators are Generalist Vision Learners

The Topological Trouble With Transformers

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Vision Hopfield Memory Networks

ECI: Effective Contrastive Information to Evaluate Hard-Negatives

From Causal Discovery to Dynamic Causal Inference in Neural Time Series

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

Beyond Means: Topological Causal Effects under Persistent-Homological Ignorability

Level Up: Defining and Exploiting Transitional Problems for Curriculum Learning

ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

MAviS: A Multimodal Conversational Assistant for Avian Species

GIPO: Gaussian Importance Sampling Policy Optimization

Benchmarking Emergent Coordination in Large-Scale LLM Populations: An Evaluation Framework on the MoltBook Archive

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

Soft Sequence Policy Optimization

FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

SAGE: Scalable AI Governance & Evaluation

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

Beyond Rewards in Reinforcement Learning for Cyber Defense

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

Reward Learning through Ranking Mean Squared Error

Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications

Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

A Systematic Analysis of Biases in Large Language Models

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Escaping the Verifier: Learning to Reason via Demonstrations

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

RAG Security and Privacy: Formalizing the Threat Model and Attack Surface

A Survey on Diffusion Language Models

In-Training Defenses against Emergent Misalignment in Language Models

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

Is Diversity All You Need for Scalable Robotic Manipulation?

Reformulating Neural Operators in $d+1$ Dimensions for Embedding Evolution

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

A Study of LLMs' Preferences for Libraries and Programming Languages

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Comprehensive and Reliable Feature Attribution for Diverse Modalities and Models via Frequency-Domain Insights

Channel-Wise Mixed-Precision Quantization for Large Language Models

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

Separation Power of Equivariant Neural Networks

Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease

Semi-Offline Reinforcement Learning for Optimized Text Generation

Knowledge Index of Noah's Ark

New

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Author

Haebom

저자

Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

개요

본 논문은 2D 엔지니어링 도면에서 핵심 정보를 정확하게 추출하기 위한 새로운 하이브리드 딥러닝 프레임워크를 제안합니다. 기존의 OCR 기술이 복잡한 레이아웃과 중첩된 기호로 인해 비정형 출력을 생성하는 문제점을 해결하기 위해, 방향 경계 상자(OBB) 검출 모델과 트랜스포머 기반 문서 파싱 모델(Donut)을 통합하는 하이브리드 접근 방식을 사용합니다. YOLOv11을 사용하여 GD&T, 일반 공차, 치수, 재료, 주석, 반지름, 표면 거칠기, 나사산, 제목 블록 등 9가지 주요 범주를 검출하고, 검출된 OBB를 잘라 Donut을 미세 조정하여 구조화된 JSON 출력을 생성합니다. 모든 범주에 대해 단일 모델과 범주별 모델을 사용하는 두 가지 미세 조정 전략을 비교 분석한 결과, 단일 모델이 모든 평가 지표에서 더 높은 정밀도(GD&T의 경우 94.77%), 재현율(대부분의 범주에서 100%), F1 점수(97.3%)를 달성하고 환각(5.23%)을 줄이는 것으로 나타났습니다. 제안된 프레임워크는 정확도를 향상시키고 수작업을 줄이며 정밀도 기반 산업에서 확장 가능한 배포를 지원합니다.

시사점, 한계점

•

시사점:

◦

2D 엔지니어링 도면에서 핵심 정보를 정확하고 효율적으로 추출하는 새로운 딥러닝 기반 프레임워크 제시

◦

OBB 검출과 트랜스포머 기반 문서 파싱 모델의 효과적인 통합을 통해 정확도 향상 및 수작업 감소

◦

단일 모델 기반 미세 조정 전략의 우수성 확인 (높은 정밀도, 재현율, F1 점수 달성 및 환각 감소)

◦

정밀도 기반 산업에서의 확장 가능한 배포 지원

•

한계점:

◦

제안된 프레임워크의 성능 평가는 연구팀 자체적으로 구축한 데이터셋에 의존적임. 다양한 도면 유형과 복잡도에 대한 일반화 성능 검증 필요.

◦

9개의 특정 범주에 대한 성능 평가로, 다른 유형의 정보 추출에 대한 일반화 가능성은 추가 연구가 필요함.

◦

YOLOv11과 Donut 모델의 특정 버전에 의존적이며, 다른 모델을 사용할 경우 성능 차이가 발생할 수 있음.

◦

실제 산업 환경 적용에 대한 추가적인 검증 및 최적화가 필요함.

PDF 보기

Made with Slashpage