Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding

Sublinear Regret for a Class of Continuous-Time Linear-Quadratic Reinforcement Learning Problems

Backdooring Bias (B^2) into Stable Diffusion Models

Embodied Instruction Following in Unknown Environments

Improving Consistency Models with Generator-Augmented Flows

OralBBNet: Spatially Guided Dental Segmentation of Panoramic X-Rays with Bounding Box Priors

Divergent Creativity in Humans and Large Language Models

SpikeNAS: A Fast Memory-Aware Neural Architecture Search Framework for Spiking Neural Network-based Embedded AI Systems

Squat: Quant Small Language Models on the Edge

Dataset Distillation via the Wasserstein Metric

The Boolean Solution Problem from the Perspective of Predicate Logic -- Extended Version

Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

World-aware Planning Narratives Enhance Large Vision-Language Model Planner

Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning

Adapting Probabilistic Risk Assessment for AI

Beating Transformers using Synthetic Cognition

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow

Using Large Language Models to Categorize Strategic Situations and Decipher Motivations Behind Human Behaviors

ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation

MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

DREAMS: A python framework for Training Deep Learning Models on EEG Data with Model Card Reporting for Medical Applications

Human Mobility Modeling with Household Coordination Activities under Limited Information via Retrieval-Augmented LLMs

Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla

Exploring a Hybrid Deep Learning Approach for Anomaly Detection in Mental Healthcare Provider Billing: Addressing Label Scarcity through Semi-Supervised Anomaly Detection

End-to-End Large Portfolio Optimization for Variance Minimization with Neural Networks through Covariance Cleaning

Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

AI4Research: A Survey of Artificial Intelligence for Scientific Research

Towards Foundation Auto-Encoders for Time-Series Anomaly Detection

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling

MILP-SAT-GNN: Yet Another Neural SAT Solver

Empowering Manufacturers with Privacy-Preserving AI Tools: A Case Study in Privacy-Preserving Machine Learning to Solve Real-World Problems

LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs

How Do Vision-Language Models Process Conflicting Information Across Modalities?

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Probing Evaluation Awareness of Language Models

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

BranchNet: A Neuro-Symbolic Learning Framework for Structured Multi-Class Classification

GPU-based complete search for nonlinear minimization subject to bounds

Enhanced Generative Model Evaluation with Clipped Density and Coverage

Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training

ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

Towards culturally-appropriate conversational AI for health in the majority world: An exploratory study with citizens and professionals in Latin America

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness

Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

Relational Causal Discovery with Latent Confounders

GPT, But Backwards: Exactly Inverting Language Model Outputs

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Comparing Optimization Algorithms Through the Lens of Search Behavior Analysis

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

GradMetaNet: An Equivariant Architecture for Learning on Gradients

Customized Exploration of Landscape Features Driving Multi-Objective Combinatorial Optimization Performance

Depth Anything at Any Condition

Tile and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation

Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

Enhanced Influence-aware Group Recommendation for Online Media Propagation

Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Autonomous AI Surveillance: Multimodal Deep Learning for Cognitive and Behavioral Monitoring

Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder

Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware

Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

AI and Remote Sensing for Resilient and Sustainable Built Environments: A Review of Current Methods, Open Data and Future Directions

Chargax: A JAX Accelerated EV Charging Simulator

Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence

Integrating Traditional and Deep Learning Methods to Detect Tree Crowns in Satellite Images

Crop Pest Classification Using Deep Learning Techniques: A Review

BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

Epistemic Scarcity: The Economics of Unresolvable Unknowns

Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities

Zero-Incentive Dynamics: a look at reward sparsity through the lens of unrewarded subgoals

NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation

Quantum-Assisted Automatic Path-Planning for Robotic Quality Inspection in Industry 4.0

Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

Hardware-software co-exploration with racetrack memory based in-memory computing for CNN inference in embedded systems

DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal

Penalizing Transparency? How AI Disclosure and Author Demographics Shape Human and AI Judgments About Writing

Evaluating LLM Agent Collusion in Double Auctions

Age Sensitive Hippocampal Functional Connectivity: New Insights from 3D CNNs and Saliency Mapping

Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound

Distributional Soft Actor-Critic with Diffusion Policy

RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

User-guided Generative Source Separation

LEDOM: An Open and Fundamental Reverse Language Model

Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Neural Hamiltonian Operator

VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process

Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

Created by

Haebom

저자

ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang

개요

ZonUI-3B는 경량화된 Vision-Language Model (VLM)로, 그래픽 사용자 인터페이스(GUI) grounding 작업에 특화되어 있습니다. 7B 이상의 파라미터를 가진 대규모 VLM과 비교하여 경쟁력 있는 성능을 달성하면서도, RTX 4090 단일 GPU로 완벽하게 학습 가능하다는 장점이 있습니다. 다양한 플랫폼(모바일, 데스크톱, 웹)의 24K GUI 스크린샷을 포함하는 다중 해상도 데이터셋을 사용하고, 크로스 플랫폼 초기 학습과 고해상도 데이터에 대한 특수 미세 조정을 통한 2단계 미세 조정 전략을 채택했습니다. 또한 데이터 큐레이션 및 중복성 감소 전략을 통해 데이터 다양성을 강조하여 데이터 양보다 질에 초점을 맞췄습니다. ScreenSpot, ScreenSpot-v2, ScreenSpot-Pro 등의 벤치마크에서 뛰어난 정확도(ScreenSpot 84.9%, ScreenSpot-v2 86.4%)를 달성하여 4B 파라미터 미만의 기존 모델들을 능가합니다. ablation study를 통해 균형 잡힌 샘플링과 2단계 미세 조정의 중요성을 확인했습니다. 모델은 https://github.com/Han1018/ZonUI-3B 에서 이용 가능합니다.

GitHub - Han1018/ZonUI-3B: ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090.

ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090. - Han1018/ZonUI-3B

시사점, 한계점

•

시사점:

◦

경량화된 VLM으로 고성능 GUI grounding 작업을 가능하게 함.

◦

단일 GPU로 학습 가능하여 접근성 향상.

◦

다양한 플랫폼과 해상도를 지원하는 데이터셋과 2단계 미세 조정 전략의 효과 입증.

◦

데이터 다양성의 중요성을 강조.

◦

4B 파라미터 미만 모델 대비 우수한 성능.

•

한계점:

◦

데이터셋 크기가 여전히 제한적일 수 있음 (24K examples).

◦

특정 유형의 GUI 또는 특정 해상도에 대한 일반화 성능에 대한 추가적인 연구 필요.

◦

실제 응용 프로그램에서의 성능 평가 및 안정성 검증 필요.

Made with Slashpage