This paper proposes the Multimodality Interactive Transformer (MM-ITF), a model that enables robots to predict target objects based on human pointing gestures in human-robot interaction (HRI). MM-ITF maps 2D pointing gestures to object locations and assigns a likelihood score to each location to identify the most likely target. Experiments were conducted with the NICOL robot in a controlled tabletop environment using monocular RGB data, demonstrating accurate target object prediction results. A patch confusion matrix was introduced to evaluate model performance. The code is available on GitHub.