This paper introduces MacroBench, a code-first benchmark that evaluates the ability to synthesize reusable browser automation programs (macros) from natural language targets. MacroBench reads HTML/DOM and generates Selenium code to perform 681 tasks across seven self-hosted sites, ranging in interaction complexity and targeting difficulty. The generated code is validated through static inspection, sandbox execution, and result verification (DOM assertions, database snapshots), and includes safety assessments for scraping, spam/exploit, and credential/privacy prompts. In 2,636 model-task runs, GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), and DeepSeek (83.4%) achieved success rates. While the models reliably handle simple tasks, they fail in complex workflows, and despite being functionally complete, they fail to meet production-quality coding standards.