This paper compares and analyzes 25 pre-trained neural network models widely used in chemical and small molecule drug design using 25 datasets. Models with various modalities, architectures, and pre-training strategies were evaluated within a fair comparative framework. Using a hierarchical Bayesian statistical test model, the analysis revealed that almost all neural network models did not significantly improve performance compared to the baseline ECFP molecular fingerprint model. Only the CLAMP model, a molecular fingerprint-based model, showed statistically significant performance improvements over the other models. These results raise concerns about the rigor of previous studies, and we discuss their causes, solutions, and practical recommendations.