Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Vector-Valued Reproducing Kernel Banach Spaces for Neural Networks and Operators

Auto-ARGUE: LLM-Based Report Generation Evaluation

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Dolphin v1.0 Technical Report

Economic Competition, EU Regulation, and Executive Orders: A Framework for Discussing AI Policy Implications in CS Courses

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Benchmarking LLM - Assisted Blue Teaming via Standardized Threat Hunting

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

LLM Watermark Evasion via Bias Inversion

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Explaining multimodal LLMs via intra-modal token interactions

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Integrated Framework for LLM Evaluation with Answer Generation

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload Systems

An Ethically Grounded LLM-Based Approach to Insider Threat Synthesis and Detection

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

Post Hoc Regression Refinement via Pairwise Rankings

On Task Vectors and Gradients

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Neural Logic Networks for Interpretable Classification

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

First Hallucination Tokens Are Different from Conditional Ones

Nonlinear Framework for Speech Bandwidth Extension

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Fair CCA for Fair Representation Learning: An ADNI Study

Model Parallelism With Subnetwork Data Parallelism

Towards a Progress Bar for Reasoning: Progress Prediction in Large Reasoning Models

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Training-free LLM Verification via Recycling Few-shot Examples

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

AS400-DET: Detection using Deep Learning Model for IBM i (AS/400)

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Estimating Visceral Adiposity from Wrist-Worn Accelerometry

MLLM-CL: Continual Learning for Multimodal Large Language Models

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Object Centric Concept Bottlenecks

The challenge of hidden gifts in multi-agent reinforcement learning

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Steering LLM Reasoning Through Bias-Only Adaptation

Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Learning to Rank Chain-of-Thought: Using a Small Model

Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Learning Hierarchical Domain Models Through Environment-Grounded Interaction

A Physics-Inspired Optimizer: Velocity Regularized Adam

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

BlobCtrl: Taming Controllable Blob for Element-level Image Editing

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making

DISCOVER: Data-driven Identification of Sub-activities via Clustering and Visualization for Enhanced Activity Recognition in Smart Homes

Toward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes

Distilling Calibration via Conformalized Credal Inference

Exploring and Controlling Diversity in LLM-Agent Conversation

Stability Bounds for the Unfolded Forward-Backward Algorithm

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

LLM-guided Task and Motion Planning using Knowledge-based Reasoning

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

3D Interaction Geometric Pre-training for Molecular Relational Learning

Balancing Multimodal Training Through Game-Theoretic Regularization

A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Hot PATE: Private Aggregation of Distributions for Diverse Task

Adversarial Attacks to Latent Representations of Distributed Neural Networks in Split Computing

Learning Dynamic Graph Embeddings with Neural Controlled Differential Equations

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Communicating-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Interactive Learning for LLM Reasoning

ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

Created by

Haebom

Author

S ebastien Salva, Redha Taguelmimt

Performance Analysis of LLM Agents for Natural Language-Based GUI Testing

Outline

This paper explores an approach that uses natural language (NL) test cases for GUI application verification, focusing specifically on the potential for LLM agents to directly execute NL test cases. To address the unsoundness and execution consistency issues of NL test cases, we propose an algorithm that executes NL test cases using a guardrail mechanism and a specialized agent that dynamically verifies each test step. Furthermore, we present metrics for evaluating test execution performance and execution consistency, and define weak unsoundness, which characterizes acceptable NL test case execution conditions at an industrial quality level (Six Sigma). Experiments using eight publicly available LLMs ranging from 3B to 70B parameters demonstrate the potential and current limitations of LLM agents for GUI testing. Experimental results show that Meta Llama 3.1 70B exhibits acceptable performance with high execution consistency (greater than 3 Sigma). A prototype tool, test suite, and results are also provided.

Takeaways, Limitations

•

Takeaways:

◦

Presenting the possibility of GUI testing using LLM agents.

◦

A proposed algorithm to address instability and consistency issues that arise when executing NL test cases.

◦

Presenting specific metrics for evaluating the performance of LLM-based tests.

◦

Excellent performance of Meta Llama 3.1 70B confirmed.

◦

Providing practical tools, test suites and results.

•

Limitations:

◦

Current limitations of LLM agents (areas for improvement) exist.

◦

The type and scope of LLMs used in the experiment may be limited.

◦

The generalizability of the proposed algorithm and indicators needs to be verified.

◦

Further research is needed to determine how the weak instability definition can be applied in real industrial settings.

View PDF

Made with Slashpage