As LLMs gain societal importance, inherent bias concerns have emerged. This study proposes a scalable benchmarking framework to assess the robustness of LLMs against adversarial bias induction. We systematically examine models across multiple tasks targeting various sociocultural biases, quantify robustness using an LLM-as-a-Judge approach, and employ jailbreaking techniques to expose security vulnerabilities. We release a curated dataset of bias-related prompts called CLEAR-Bias and identify DeepSeek V3 as the most reliable Judge LLM. Age, disability, and cross-bias are the most prominent findings. Some smaller models outperform larger models, and jailbreaking attacks are effective on all models.