This paper proposes a novel approach that leverages pixel-level analysis and multiple instance learning to overcome the limitations of existing county-level spatial aggregation methods for predicting US corn yields. Specifically, we apply an attention mechanism to automatically assign pixel-specific weights to mitigate the effects of noise, addressing the issue of mixed pixels caused by resolution mismatches between satellite imagery and crop masks. Experimental results demonstrate that our proposed approach outperforms four existing machine learning models based on five years of data from the US Corn Belt, achieving a coefficient of determination (R²) of 0.84 and a root mean square error (RMSE) of 0.83 in 2022. We demonstrate the advantages of our approach from both spatial and temporal perspectives, and we verify its ability to remove noise and capture important feature information by analyzing the relationship between mixed pixels and the attention mechanism.