This paper studies the use of Large Language Model (LLM) agents to solve structured victim rescue tasks in multi-agent environments. LLM agents operate in a graph-based environment requiring labor division, prioritization, and collaborative planning, and must allocate resources to victims with varying needs and urgency levels. We systematically evaluate performance using various collaboration-sensitive metrics, including task success rate, duplicate work, room collisions, and urgency-weighted efficiency. This study provides new insights into the strengths and failure modes of LLM in physically based multi-agent collaborative tasks, contributing to future benchmarks and architecture improvements.