/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
CEHR-XGPT: A Scalable Multi-Task Foundation Model for Electronic Health Records
Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
Adaptive Learning Strategies for Mitotic Figure Classification in MIDOG2025 Challenge
MitoDetect++: A Domain-Robust Pipeline for Mitosis Detection and Atypical Subtyping
Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Fantastic Pretraining Optimizers and Where to Find Them
Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework
TECP: Token-Entropy Conformal Prediction for LLMs
The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management
Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees
Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection
HuggingGraph: Understanding the Supply Chain of LLM Ecosystem
Food safety trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification
BayesSDF: Surface-Based Laplacian Uncertainty Estimation for 3D Geometry with Neural Signed Distance Fields
Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework
The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations
AI-Assisted Rapid Crystal Structure Generation Towards a Target Local Environment
First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning
AutoPDL: Automatic Prompt Optimization for LLM Agents
RailGoerl24: G\"orlitz Rail Test Center CV Dataset 2024
Revealing higher-order neural representations of uncertainty with the Noise Estimation through Reinforcement-based Diffusion (NERD) model
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
Spoof Trace Discovery for Deep Learning Based Explainable Face Anti-Spoofing
The Information Security Awareness of Large Language Models
Automatically Detecting Online Deceptive Patterns
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Automated detection of underdiagnosed medical conditions via opportunistic imaging
Selective Preference Optimization via Token-Level Reward Function Estimation
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
PersonaGym: Evaluating Persona Agents and LLMs
CFaults: Model-Based Diagnosis for Fault Localization in C Programs with Multiple Test Cases
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Demystifying Chains, Trees, and Graphs of Thoughts
Survival Analysis with Adversarial Regularization
Net2Brain: A Toolbox to compare artificial vision models with human brain responses
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Dynamic Speculative Agent Planning
AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning
Graph RAG as Human Choice Model: Building a Data-Driven Mobility Agent with Preference Chain
MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design
Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment
DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning
Translating Federated Learning Algorithms in Python into CSP Processes Using ChatGPT
ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding
Epistemic Skills: Reasoning about Knowledge and Oblivion
Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
GUIエージェント:A Survey
Neural Network Verification with PyRAT
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation
MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts
WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
SpikingBrain Technical Report: Spiking Brain-inspired Large Models
Scaling Performance of Large Language Model Pretraining
Recomposer: Event-roll-guided generative audio editing
COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
Uncertain but Useful: Leveraging CNN Variability into Data Augmentation
CURE: Controlled Unlearning for Robust Embeddings - Mitigating Conceptual Shortcuts in Pre-Trained Language Models
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks
Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet
AI Agents for Web Testing: A Case Study in the Wild
Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection
Exploring Situated Stabilities of a Rhythm Generation System through Variational Cross-Examination
GenAI-based test case generation and execution in SDV platform
ICR: Iterative Clarification and Rewriting for Conversational Search
ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions
Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization
Pointing-Guided Target Estimation via Transformer-Based Attention
Adversarial Augmentation and Active Sampling for Robust Cyber Anomaly Detection
LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration
High-Resolution Global Land Surface Temperature Retrieval via a Coupled Mechanism-Machine Learning Framework
Exploring an implementation of quantum learning pipeline for support vector machines
DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation
Artificial intelligence for representing and characterizing quantum systems
PLaMo 2 Technical Report
SpiderNets: Estimating Fear Ratings of Spider-Related Images with Vision Models
The Paradox of Doom: Acknowledging Extinction Risk Reduces the Incentive to Prevent It
A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing
REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution
AI-Driven Fronthaul Link Compression in Wireless Communication Systems: Review and Method Design
Toward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images
Graph Unlearning: Efficient Node Removal in Graph Neural Networks
Enhancing Diversity in Large Language Models via Determinantal Point Processes
VARMA-Enhanced Transformer for Time Series Forecasting
The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
Load more
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Created by
Haebom
作者
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
概要
この論文では、オープンソースのGUIエージェントモデルであるGUI-Owlと、それに基づく一般目的のGUIエージェントフレームワークであるMobile-Agent-v3を紹介します。 GUI-Owlは、デスクトップおよびモバイル環境で10個のGUIベンチマークを対象に最先端のパフォーマンスを達成し、特にAndroidWorldとOSWorldでそれぞれ66.4と29.4のスコアを記録した。 Mobile-Agent-v3は、GUI-Owlをベースにパフォーマンスをさらに向上させ、AndroidWorldとOSWorldでそれぞれ73.3と37.7のスコアを達成し、オープンソースのGUIエージェントフレームワーク分野の新たな最高性能を記録した。 GUI-Owlは、大規模な環境インフラストラクチャ、さまざまな基本エージェント機能、スケーラブルな環境強化学習という3つのコアイノベーションを統合しています。大規模な環境インフラストラクチャは、Android、Ubuntu、macOS、Windowsを含むクラウドベースの仮想環境を提供し、さまざまなデータパイプラインをサポートし、手動のコメント操作を減らします。さまざまな基本的なエージェント機能は、UIのグループ化、計画、アクションセマンティックス、推論パターンを統合してエンドツーエンドの意思決定をサポートします。スケーラブルな環境強化学習は、完全非同期訓練によって実環境との整合性を高め、Trajectory-aware Relative Policy Optimization(TRPO)を通じてOSWorldで34.9のスコアを達成しました. GUI-OwlとMobile-Agent-v3は
https://github.com/X-PLUG/MobileAgent
でオープンソースとして公開されました。
GitHub - X-PLUG/MobileAgent: Mobile-Agent: The Powerful GUI Agent Family
Mobile-Agent: The Powerful GUI Agent Family. Contribute to X-PLUG/MobileAgent development by creating an account on GitHub.
github.com
Takeaways、Limitations
•
Takeaways:
◦
オープンソースのGUIエージェントモデルとフレームワークの分野で新しい最高のパフォーマンスを達成。
◦
大規模な環境インフラストラクチャ、さまざまな基本的なエージェント機能、スケーラブルな強化学習フレームワークの効果を証明します。
◦
自動化されたデータ生成と検証による効率的なデータ収集と学習方法の提示
◦
さまざまなプラットフォーム(Android、Ubuntu、macOS、Windows)のサポート。
◦
モジュラー設計によるマルチエージェントシステムにおける利用可能性の提示
•
Limitations:
◦
ベンチマークの種類と数が限られている可能性があります。さまざまなGUI環境とタスクの一般化パフォーマンス検証が必要です。
◦
実際の世界における複雑なGUIインタラクションのためのロバストネスのさらなる評価が必要です。
◦
TRPOなどの特定のアルゴリズムのパフォーマンスの分析が不足している可能性があります。他の強化学習アルゴリズムとの比較分析が必要
◦
モデルの解釈性と説明の可能性に関する研究が不足している可能性があります。
PDFを見る
Made with Slashpage