Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Created by
  • Haebom

Author

Hyunjun Kim, Sejong Kim

MacroBench: LLM-based browser automation macro synthesis benchmark

Outline

This paper introduces MacroBench, a code-first benchmark that evaluates the ability to synthesize reusable browser automation programs (macros) from natural language targets. MacroBench reads HTML/DOM and generates Selenium code to perform 681 tasks across seven self-hosted sites, ranging in interaction complexity and targeting difficulty. The generated code is validated through static inspection, sandbox execution, and result verification (DOM assertions, database snapshots), and includes safety assessments for scraping, spam/exploit, and credential/privacy prompts. In 2,636 model-task runs, GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), and DeepSeek (83.4%) achieved success rates. While the models reliably handle simple tasks, they fail in complex workflows, and despite being functionally complete, they fail to meet production-quality coding standards.

Takeaways, Limitations

Takeaways:
LLM demonstrates successful results in synthesizing browser automation macros.
MacroBench provides benchmarks to evaluate the performance of LLM through tasks of varying difficulty.
Enable reproducible evaluations by publishing benchmarks and evaluation frameworks.
Limitations:
Poor model performance in complex workflows.
Does not meet production quality coding standards.
👍