This paper addresses the potential and limitations of large-scale language models (LLMs) in autonomous driving laboratories (SDLs) for materials research. We introduce AILA, a framework for automating atomic force microscopy (AFM) with LLM-based agents, and develop AFMBench, a comprehensive evaluation tool for evaluating AI agents across the entire scientific workflow from experimental design to results analysis. Our evaluation results show that even state-of-the-art models struggle with basic tasks and tuning scenarios, and in particular, Claude 3.5 performs well on the materials domain question-answering (QA) benchmark but unexpectedly underperforms AILA. This suggests that domain-specific QA capabilities do not lead to effective agent functionality. We also find that LLMs are prone to deviations from instructions and prompt vulnerabilities, where small changes in prompts can significantly affect performance, raising safety alignment concerns for SDL applications. We demonstrate that a multi-agent framework outperforms a single-agent architecture, and we evaluate the effectiveness of AILA on increasingly difficult experiments, including AFM calibration, feature detection, mechanical property measurements, graphene layer counting, and indenter detection.