The safety of large-scale language models (LLMs) is one of the most pressing challenges for widespread deployment. Unlike previous research that focuses on general harmfulness, enterprises have fundamental concerns about whether LLM-based agents are safe for their intended use cases. To address this issue, we define "operational safety" as the ability of an LLM to appropriately accept or reject user queries for a specific purpose, and propose "OffTopicEval," an evaluation suite and benchmark for measuring operational safety in general and specific agent use cases. Evaluation results on six model families consisting of 20 open-weighted LLMs reveal that none of the models maintain a high level of operational safety. To address this issue, we propose query-based (Q-ground) and system-prompt-based (P-ground) prompt-based steering methods, significantly improving OOD rejection.