While large-scale language models (LLMs) enable the automation of social science research, their outputs can vary significantly depending on researcher choices (e.g., model selection, prompt strategy). This variability can influence analyses by introducing systematic bias and random errors, leading to Type I, II, S, and M errors. This phenomenon is referred to as LLM hacking. Intentional LLM hacking is simple, and replication of 37 data annotation tasks demonstrates that simply modifying prompts can yield statistically significant results. Furthermore, an analysis of 13 million labels from 18 LLMs across 2,361 realistic hypotheses revealed a high risk of inadvertent LLM hacking, even when standard research methods are followed. State-of-the-art LLMs yielded incorrect conclusions in approximately 31% of hypotheses, while small-scale language models yielded incorrect conclusions in half of hypotheses. The risk of LLM hacking decreased as effect size increased, demonstrating the critical role of human annotation in preventing false positives. Practical recommendations for preventing LLM hacking are presented.