This paper addresses the problem of "LLM hacking," which arises when using large-scale language models (LLMs) in social science research. Data annotation and text analysis using LLMs can significantly vary depending on the researcher's implementation choices, such as model selection, prompt strategy, and temperature settings. This can lead to systematic bias and random errors, resulting in Type I, II, S, and M errors. The researchers replicated 37 data annotation tasks from 21 social science research papers using 18 different models, analyzed 13 million LLM labels, and tested 2,361 hypotheses to measure the impact of researcher choices on statistical conclusions. The results showed that state-of-the-art models and small-scale language models yielded incorrect conclusions based on LLM annotation data in approximately one-third of hypotheses, while small-scale models yielded approximately half of hypotheses. High task performance and superior general model features reduce, but do not eliminate, the risk of LLM hacking, and the risk decreases as effect sizes increase. Furthermore, we demonstrate that intentional LLM hacking can be performed quite simply, and that any result can be presented as statistically significant with just a few LLMs and a few prompt variations. In conclusion, this highlights the importance of minimizing errors in social science research utilizing LLMs through human annotation and careful model selection.