This paper aims to develop a semantically explicit, spatially sensitive, domain-independent, and intuitive target assignment method for guiding agent-to-agent interactions in 3D environments. In particular, we propose a novel cross-view target alignment framework that allows users to assign target objects using segmentation masks from their own camera views rather than the agent’s observations. We highlight that when the camera views of humans and agents differ significantly, action replication alone fails to align the agent’s actions with human intentions. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, to explicitly enhance the agent’s spatial reasoning ability. Based on this, we develop ROCKET-2, a state-of-the-art agent trained on Minecraft, which improves the inference efficiency by 3x to 6x compared to ROCKET-1. ROCKET-2 demonstrates that it can improve human-agent interactions by directly interpreting targets from the human camera view. Notably, ROCKET-2 demonstrates zero-shot generalization ability. Despite being trained exclusively on the Minecraft dataset, it can adapt and generalize to other 3D environments such as Doom, DMLab, and Unreal with simple action space mapping.