To address the complex inference and long-term task planning challenges of Vision-Language-Action (VLA) models, we propose ManiAgent, an agent-based architecture that converts task descriptions and environmental inputs into robot manipulation actions end-to-end. This architecture efficiently handles complex manipulation scenarios by leveraging inter-agent communication for environmental perception, subtask decomposition, and action generation. It achieves an 86.8% success rate on the SimplerEnv benchmark and a 95.8% success rate on a real-world pick-and-place task, enabling efficient data collection for VLA models that perform similarly to models trained on human-annotated datasets.