This paper systematically evaluates nine existing large-scale language models (LLMs) using 5,000 claims evaluated by 174 expert fact-checking organizations in 47 languages. LLMs are evaluated across a variety of categories (open/closed source, various sizes, various architectures, and inference-based). To test the models' generalization ability, we use four prompting strategies that reflect the interactions between citizen and expert fact-checkers and claims generated later than the training data. Based on over 240,000 human annotations, we find a phenomenon similar to the "Danning-Kruger effect," where small-scale models exhibit high confidence despite lower accuracy, while large-scale models exhibit high accuracy but lower confidence. This poses a risk of systematic bias in information verification, especially when small-scale models are used by under-resourced organizations. The performance gap is most pronounced for claims in languages other than English and from the Global South, potentially exacerbating existing information inequalities. These findings establish a multilingual benchmark for future research and provide policy rationale to ensure equitable access to reliable AI-assisted fact-checking.