This paper addresses the challenges of visual language models (VLMs) in understanding and following multimodal assembly instructions, especially when sophisticated spatial reasoning and accurate object state detection are required. We present LEGO Co-builder, a hybrid benchmark that combines real-world LEGO assembly logic with programmatically generated multimodal scenes. This dataset captures step-by-step visual states and procedural instructions, enabling controlled evaluation of instruction following, object detection, and state detection. Leading VLMs such as GPT-4o, Gemini, and Qwen-VL are evaluated against our unified framework in zero-shot and fine-tuning settings. The results show that even advanced models such as GPT-4o struggle with sophisticated assembly tasks, with a maximum F1 score of only 40.54% for state detection, demonstrating a gap in sophisticated visual understanding. To support future multimodal assembly assistance research, we make the benchmark, codebase, and generation pipeline public.