This paper highlights the importance of understanding fine-grained object affordances for robotic object manipulation in unstructured environments. Existing visual feature prediction methods are limited by their reliance on manually annotated data or their limitations, which limit them to a predefined task set. In response, we present Unsupervised Affordance Distillation (UAD), a method that distills feature knowledge from a foundation model into a task-conditional feature model without any manual annotation. Leveraging the complementary strengths of large-scale vision models and vision-language models, UAD automatically annotates a large dataset of pairs. By training a lightweight task-conditional decoder on fixed features, UAD demonstrates remarkable generalization performance across real-world robotic environments and diverse human activities, despite being trained only on rendered objects in simulations. Using the features provided by UAD as the observation space, we propose an imitation learning policy that demonstrates promising generalization performance across unseen object instances, object categories, and variations in task instructions, even after training on only 10 exemplars.