While OpenAI's o3-preview inference model surpassed human accuracy on the ARC-AGI benchmark, we investigate whether state-of-the-art models recognize and infer the abstractions intended by the task creators. We examine the abstraction capabilities of models in ConceptARC. We evaluate models using settings that vary the input format (text vs. visual), whether the model can use external Python tools, and the amount of inference effort the inference model exerts. In addition to measuring output accuracy, we meticulously evaluate the natural language rules the model generates to explain its solution. This double evaluation allows us to assess whether the model solves the task using the abstractions designed by ConceptARC, rather than relying on surface-level patterns. Our results show that while some models using text-based representations match human output accuracy, the best model rules often rely on surface-level "shortcuts" and capture significantly less of the intended abstraction than humans. Therefore, evaluating accuracy alone can overestimate general abstract reasoning ability. While the output accuracy of AI models decreases sharply in visual modes, rule-level analysis suggests that the model may be underestimating this. While a significant proportion of rules still capture the intended abstraction, these rules are often not applied correctly. In short, the results show that models still lag behind humans in abstract reasoning, and using accuracy alone to assess abstract reasoning in tasks like ARC can overestimate abstract reasoning ability in textual contexts and underestimate it in visual contexts.