This paper highlights that while imitation learning enables skilled robot behavior, it struggles with low sample efficiency and limited generalization, making it difficult to address long-term, multi-object tasks. Existing methods require numerous demonstrations to address possible task variations, making them costly and impractical for real-world applications. This study introduces oriented affordance frames, a structured representation of state and action spaces, to improve spatial and category generalization and efficiently train policies with just 10 demonstrations. More importantly, this abstraction enables the compositional generalization of independently trained subpolicies to address long-term, multi-object tasks. To facilitate smooth transitions between subpolicies, we introduce the concept of self-progress prediction, derived directly from the duration of training demonstrations. Experiments on three real-world tasks involving multi-step, multi-object interactions demonstrate that the policies generalize robustly to unseen object appearances, geometric shapes, and spatial arrangements, despite a small amount of data, and achieve high success rates without relying on extensive training data.