This paper evaluates the performance of physics expert recommendation using six open-weight large-scale language models (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, and llama3.1-70b). We examine biases related to consistency, factuality, gender, ethnicity, academic popularity, and scholar similarity across five tasks (top k experts by field, influential scientists by discipline, era, seniority, and scholar-correspondence). We establish academic benchmarks based on real-world data from the American Physical Society (APS) and OpenAlex, and compare model outputs to real-world academic records. Our analysis reveals inconsistencies and biases in all models, with mixtral-8x7b producing the most stable output, while llama3.1-70b exhibited the highest variability. Many models exhibited redundancies, with gemma2-9b and llama3.1-8b in particular exhibiting a high prevalence of formatting errors. While LLMs generally recommend actual scientists, they consistently show poor accuracy when querying by field, era, and seniority, favoring senior scholars. Representation biases persist, including gender imbalance (male-dominated), underrepresentation of Asian scientists, and overrepresentation of White scholars. Despite the diversity of institutions and collaboration networks, the model favors highly cited and productive scholars, reinforcing the rich-get-richer effect while providing limited geographic representation.