This paper introduces HardcoreLogic, a benchmark designed to evaluate the ability of large-scale inference models (LRMs) to flexibly adapt rules to various conditions, particularly when faced with non-standard game variations. To overcome the limitations of existing benchmarks, HardcoreLogic includes over 5,000 puzzles across ten games, modifying standard puzzles along three dimensions: increasing complexity (IC), singularity (UE), and unsolvable puzzles (UP). Experimental results show that even models that achieve high scores on existing benchmarks significantly degrade on HardcoreLogic, suggesting a reliance on memorized patterns.