This paper presents a study that improves and applies feature inversion techniques to understand the operating principles of deep neural networks, particularly Transformer-based vision models (Detection Transformer and Vision Transformer). We propose a novel modular transformation technique that enhances the efficiency of existing feature inversion techniques. Through qualitative and quantitative analysis of the reconstructed images, we gain insight into the model's internal representation. Specifically, we analyze how the model encodes contextual shape and image details, the correlations between layers, and its robustness to color changes. The experimental code is publicly available.