In this paper, we focus on the problem of “model jailbreaking”, which induces malicious responses, as the importance of studying failure cases of large-scale language models (LLMs) increases. Based on recent advances in standardization, measurement, and test-time computation extension, we present a model optimization methodology to achieve high performance on difficult tasks. We perform automated model jailbreaking by exploiting loss signals that drive test-time computations via an adversarial inference approach, achieving state-of-the-art attack success rates on multiple aligned LLMs. This also applies to models that trade inference-time computation for adversarial robustness. In conclusion, we present a new paradigm for LLM vulnerabilities, and lay the foundation for developing more robust and reliable AI systems.