This paper evaluates the performance of Language Model (LM) re-rankers used to improve retrieval results in Retrieval-Augmented Generation (RAG). Six LM re-rankers are compared with the BM25 baseline on three datasets: NQ, LitQA2, and DRUID. The experimental results show that the LM re-ranker does not outperform the BM25 baseline on the DRUID dataset, and a novel separation metric based on BM25 scores is used to explain and identify the re-ranker errors due to lack of lexical similarity. We also investigate various methods to improve the performance of the LM re-ranker, but find that these methods are mainly effective only on the NQ dataset. In conclusion, this study reveals the weaknesses of the LM re-ranker and suggests the need for evaluation using more adversarial and realistic datasets.