Daily Arxiv

世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。

CEHR-XGPT: A Scalable Multi-Task Foundation Model for Electronic Health Records

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Adaptive Learning Strategies for Mitotic Figure Classification in MIDOG2025 Challenge

MitoDetect++: A Domain-Robust Pipeline for Mitosis Detection and Atypical Subtyping

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Fantastic Pretraining Optimizers and Where to Find Them

Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework

TECP: Token-Entropy Conformal Prediction for LLMs

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees

Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Food safety trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database

Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification

BayesSDF: Surface-Based Laplacian Uncertainty Estimation for 3D Geometry with Neural Signed Distance Fields

Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework

The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations

AI-Assisted Rapid Crystal Structure Generation Towards a Target Local Environment

First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning

AutoPDL: Automatic Prompt Optimization for LLM Agents

RailGoerl24: G\"orlitz Rail Test Center CV Dataset 2024

Revealing higher-order neural representations of uncertainty with the Noise Estimation through Reinforcement-based Diffusion (NERD) model

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Spoof Trace Discovery for Deep Learning Based Explainable Face Anti-Spoofing

The Information Security Awareness of Large Language Models

Automatically Detecting Online Deceptive Patterns

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Automated detection of underdiagnosed medical conditions via opportunistic imaging

Selective Preference Optimization via Token-Level Reward Function Estimation

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

PersonaGym: Evaluating Persona Agents and LLMs

CFaults: Model-Based Diagnosis for Fault Localization in C Programs with Multiple Test Cases

From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Demystifying Chains, Trees, and Graphs of Thoughts

Survival Analysis with Adversarial Regularization

Net2Brain: A Toolbox to compare artificial vision models with human brain responses

The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Dynamic Speculative Agent Planning

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Graph RAG as Human Choice Model: Building a Data-Driven Mobility Agent with Preference Chain

MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

Translating Federated Learning Algorithms in Python into CSP Processes Using ChatGPT

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Epistemic Skills: Reasoning about Knowledge and Oblivion

Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

GUIエージェント：A Survey

Neural Network Verification with PyRAT

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation

MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

SpikingBrain Technical Report: Spiking Brain-inspired Large Models

Scaling Performance of Large Language Model Pretraining

Recomposer: Event-roll-guided generative audio editing

COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Uncertain but Useful: Leveraging CNN Variability into Data Augmentation

CURE: Controlled Unlearning for Robust Embeddings - Mitigating Conceptual Shortcuts in Pre-Trained Language Models

HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks

Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet

AI Agents for Web Testing: A Case Study in the Wild

Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection

Exploring Situated Stabilities of a Rhythm Generation System through Variational Cross-Examination

GenAI-based test case generation and execution in SDV platform

ICR: Iterative Clarification and Rewriting for Conversational Search

ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization

Pointing-Guided Target Estimation via Transformer-Based Attention

Adversarial Augmentation and Active Sampling for Robust Cyber Anomaly Detection

LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration

High-Resolution Global Land Surface Temperature Retrieval via a Coupled Mechanism-Machine Learning Framework

Exploring an implementation of quantum learning pipeline for support vector machines

DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation

Artificial intelligence for representing and characterizing quantum systems

PLaMo 2 Technical Report

SpiderNets: Estimating Fear Ratings of Spider-Related Images with Vision Models

The Paradox of Doom: Acknowledging Extinction Risk Reduces the Incentive to Prevent It

A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing

REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution

AI-Driven Fronthaul Link Compression in Wireless Communication Systems: Review and Method Design

Toward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images

Graph Unlearning: Efficient Node Removal in Graph Neural Networks

Enhancing Diversity in Large Language Models via Determinantal Point Processes

VARMA-Enhanced Transformer for Time Series Forecasting

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Mobile-Agent-v3: Foundamental Agents for GUI Automation

Created by

Haebom

作者

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan

概要

この論文では、オープンソースのGUIエージェントモデルであるGUI-Owlと、それに基づく一般目的のGUIエージェントフレームワークであるMobile-Agent-v3を紹介します。 GUI-Owlは、デスクトップおよびモバイル環境で10個のGUIベンチマークを対象に最先端のパフォーマンスを達成し、特にAndroidWorldとOSWorldでそれぞれ66.4と29.4のスコアを記録した。 Mobile-Agent-v3は、GUI-Owlをベースにパフォーマンスをさらに向上させ、AndroidWorldとOSWorldでそれぞれ73.3と37.7のスコアを達成し、オープンソースのGUIエージェントフレームワーク分野の新たな最高性能を記録した。 GUI-Owlは、大規模な環境インフラストラクチャ、さまざまな基本エージェント機能、スケーラブルな環境強化学習という3つのコアイノベーションを統合しています。大規模な環境インフラストラクチャは、Android、Ubuntu、macOS、Windowsを含むクラウドベースの仮想環境を提供し、さまざまなデータパイプラインをサポートし、手動のコメント操作を減らします。さまざまな基本的なエージェント機能は、UIのグループ化、計画、アクションセマンティックス、推論パターンを統合してエンドツーエンドの意思決定をサポートします。スケーラブルな環境強化学習は、完全非同期訓練によって実環境との整合性を高め、Trajectory-aware Relative Policy Optimization(TRPO)を通じてOSWorldで34.9のスコアを達成しました. GUI-OwlとMobile-Agent-v3はhttps://github.com/X-PLUG/MobileAgentでオープンソースとして公開されました。

GitHub - X-PLUG/MobileAgent: Mobile-Agent: The Powerful GUI Agent Family

Mobile-Agent: The Powerful GUI Agent Family. Contribute to X-PLUG/MobileAgent development by creating an account on GitHub.

Takeaways、Limitations

•

Takeaways:

◦

オープンソースのGUIエージェントモデルとフレームワークの分野で新しい最高のパフォーマンスを達成。

◦

大規模な環境インフラストラクチャ、さまざまな基本的なエージェント機能、スケーラブルな強化学習フレームワークの効果を証明します。

◦

自動化されたデータ生成と検証による効率的なデータ収集と学習方法の提示

◦

さまざまなプラットフォーム（Android、Ubuntu、macOS、Windows）のサポート。

◦

モジュラー設計によるマルチエージェントシステムにおける利用可能性の提示

•

Limitations：

◦

ベンチマークの種類と数が限られている可能性があります。さまざまなGUI環境とタスクの一般化パフォーマンス検証が必要です。

◦

実際の世界における複雑なGUIインタラクションのためのロバストネスのさらなる評価が必要です。

◦

TRPOなどの特定のアルゴリズムのパフォーマンスの分析が不足している可能性があります。他の強化学習アルゴリズムとの比較分析が必要

◦

モデルの解釈性と説明の可能性に関する研究が不足している可能性があります。

Made with Slashpage