This paper evaluates the performance of a small language model (SLM) and a pre-trained large language model (LLM) for automatically identifying inappropriate language (IUL) in medical education materials. Using a dataset of approximately 500 documents (over 12,000 pages), we compared various SLM models, including an IUL general classifier, a subcategory-specific binary classifier, a multi-label classifier, and a hierarchical pipeline, as well as an LLM (Llama-3 8B and 70B) with several prompt variations. The results showed that the SLM significantly outperformed the LLM using carefully constructed shots, and in particular, the subcategory-specific binary classifier, trained on negative examples in sections devoid of inappropriate language, performed best.