In this paper, we propose TreeBench, a new benchmark for comprehensively evaluating the capabilities of visual evidence-based reasoning models. TreeBench is built on three principles: visual recognition of subtle objects in complex scenes, traceable evidence via bounding box evaluation, and second-order inference that tests object interactions and spatial hierarchies beyond simple object localization. It consists of 405 challenging visual question-answer pairs, which are manually annotated by experts, sampling 1,000 high-quality images from the SA-1B dataset. Even existing state-of-the-art models do not reach 60% accuracy, with OpenAI-o3 achieving 54.87% accuracy. In addition, we present TreeVGR, a training paradigm that jointly supervises localization and inference using reinforcement learning, enabling accurate localization and explainable inference paths. TreeVGR shows significant performance improvements over V*Bench, MME-RealWorld, and TreeBench.