This paper investigates whether pre-trained computer vision models can accurately predict fear levels in spider-related images, providing a foundational study for developing an adaptive computer exposure therapy system. Using transfer learning, three different models were applied to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. Cross-validation results showed a mean absolute error (MAE) of 10.1–11.0. Learning curve analysis revealed that reducing the dataset size resulted in a decrease in performance, but further increasing the dataset size did not significantly improve performance. Explainability assessment demonstrated that the model's predictions were based on spider-related features, and category-specific error analysis identified visual conditions associated with high error rates, such as distant views and artificial/painted spiders. This study demonstrates the potential of explainable computer vision models for fear rating prediction and highlights the importance of model explainability and sufficient dataset size for the development of effective emotion-recognition therapy.