To evaluate the algorithmic inference capabilities of large-scale language models (LLMs), we focus on structural reasoning, which involves understanding and manipulating relationships such as order, hierarchy, and connectivity. DSR-Bench is the first benchmark to systematically evaluate LLM structural reasoning using standard data structures that provide interpretable and algorithmically meaningful abstractions. It includes 20 data structures, 35 operations, and 4,140 synthetically generated problem instances. Its hierarchical design allows for precise identification of specific failure modes, while fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs revealed that even the best-performing model scored 0.498 on difficult instances, revealing significant limitations. Further evaluation revealed weaknesses in inference capabilities for spatial data, natural language scenarios, and self-generated code. DSR-Bench provides a principled diagnostic tool for structural reasoning, identifying inference bottlenecks and supporting the development of more robust and reliable LLMs.