Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

Avoidance Decoding for Diverse Multi-Branch Story Generation

HydroVision: Predicting Optically Active Parameters in Surface Water Using Computer Vision

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

MSA2-Net: Utilizing Self-Adaptive Convolution Module to Extract Multi-Scale Information in Medical Image Segmentation

Q-Learning-Driven Adaptive Rewiring for Cooperative Control in Heterogeneous Networks

Spotlighter: Revisiting Prompt Tuning from a Representative Mining View

Multimodal Iterative RAG for Knowledge Visual Question Answering

Embodied AI: Emerging Risks and Opportunities for Policy Action

Meta-learning ecological priors from large language models explains human learning and decision making

Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion

Locus: Agentic Predicate Synthesis for Directed Fuzzing

Network-Level Prompt and Trait Leakage in Local Research Agents

The Information Dynamics of Generative Diffusion

Arbitrary Precision Printed Ternary Neural Networks with Holistic Evolutionary Approximation

Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms

LinkAnchor: An Autonomous LLM-Based Agent for Issue-to-Commit Link Recovery

MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

LanternNet: A Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations

When and Where do Data Poisons Attack Textual Inversion?

Covering a Few Submodular Constraints and Applications

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization

Multimodal Medical Image Binding via Shared Text Embeddings

Open-Set LiDAR Panoptic Segmentation Guided by Uncertainty-Aware Learning

Revisiting Clustering of Neural Bandits: Selective Reinitialization for Mitigating Loss of Plasticity

LLM Embedding-based Attribution (LEA): Quantifying Source Contributions to Generative Model's Response for Vulnerability Analysis

A theoretical framework for self-supervised contrastive learning for continuous dependent data

Securing AI Agents with Information-Flow Control

FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

When a Reinforcement Learning Agent Encounters Unknown Unknowns

Group-in-Group Policy Optimization for LLM Agent Training

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

LawFlow: Collecting and Simulating Lawyers' Thought Processes on Business Formation Case Studies

On Developers' Self-Declaration of AI-Generated Code: An Analysis of Practices

WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

HDVIO2.0: Wind and Disturbance Estimation with Hybrid Dynamics VIO

TruthLens: Visual Grounding for Universal DeepFake Reasoning

Impoola: The Power of Average Pooling for Image-Based Deep Reinforcement Learning

Efficiently Editing Mixture-of-Experts Models with Compressed Experts

Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Investigating a Model-Agnostic and Imputation-Free Approach for Irregularly-Sampled Multivariate Time-Series Modeling

Rapid Word Learning Through Meta In-Context Learning

FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

Predict, Cluster, Refine: A Joint Embedding Predictive Self-Supervised Framework for Graph Representation Learning

Survey on Hand Gesture Recognition from Visual Input

Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning

GalaxAlign: Mimicking Citizen Scientists' Multimodal Guidance for Galaxy Morphology Analysis

Soft-Transformers for Continual Learning

Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Domain Consistency Representation Learning for Lifelong Person Re-Identification

Aligning Machine and Human Visual Representations across Abstraction Levels

Towards Agentic AI on Particle Accelerators

Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Banishing LLM Hallucinations Requires Rethinking Generalization

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games

Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

P2DT: Mitigating Forgetting in task-incremental Learning with progressive prompt Decision Transformer

Towards Agentic OS: An LLM Agent Framework for Linux Schedulers

CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs

ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

AHELM: A Holistic Evaluation of Audio-Language Models

The Ramon Llull's Thinking Machine for Automated Ideation

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Integrating Activity Predictions in Knowledge Graphs

Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks

ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP

Deep Research Agents: A Systematic Examination And Roadmap

Gradients: When Markets Meet Fine-tuning -- A Distributed Approach to Model Optimization

ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research

Shutdownable Agents through POST-Agency

CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

Can Large Language Models Act as Ensembler for Multi-GNNs?

MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration

Frugal inference for control

On Generating Monolithic and Model Reconciling Explanations in Probabilistic Scenarios

A Survey on Human-AI Collaboration with Large Foundation Models

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Created by

Haebom

Author

Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

Outline

This paper proposes a novel hybrid deep learning framework for accurately extracting key information from 2D engineering drawings. To address the issue of conventional OCR techniques generating unstructured output due to complex layouts and overlapping symbols, we utilize a hybrid approach that integrates an oriented bounding box (OBB) detection model and a transformer-based document parsing model (Donut). Using YOLOv11, we detect nine major categories—GD&T, general tolerances, dimensions, materials, annotations, radii, surface roughness, threads, and title blocks—and fine-tune Donut to generate structured JSON output. We compare two fine-tuning strategies: a single model for all categories and a category-specific model. We find that the single model achieves higher precision (94.77% for GD&T), recall (100% for most categories), F1 score (97.3%), and reduces hallucinations (5.23%) across all evaluation metrics. The proposed framework improves accuracy, reduces manual work, and supports scalable deployment in precision-based industries.

Takeaways, Limitations

•

Takeaways:

◦

We present a novel deep learning-based framework for accurately and efficiently extracting key information from 2D engineering drawings.

◦

Improved accuracy and reduced manual effort through effective integration of OBB detection and Transformer-based document parsing models.

◦

Validation of the superiority of a single-model-based fine-tuning strategy (high precision, recall, F1 score achievement, and reduced hallucinations)

◦

Supporting scalable deployment in precision-based industries

•

Limitations:

◦

The performance evaluation of the proposed framework relies on a dataset built by the research team itself. Generalization performance across various drawing types and complexities needs to be verified.

◦

Performance was evaluated for nine specific categories, and generalizability to other types of information extraction requires further study.

◦

It depends on the specific version of YOLOv11 and Donut model, and performance may vary when using other models.

◦

Further validation and optimization are required for application in real industrial environments.

View PDF

Made with Slashpage