This study is the first to comprehensively evaluate the performance of large-scale language models (LLMs) across three counseling roles in a Japanese therapy setting. We simultaneously evaluated counselor AI systems (GPT-4-turbo, Claude-3-Opus-SMDP using zero-shot prompting or structured multi-step conversation prompting (SMDP), client AI simulations, and evaluator AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Experienced human experts (n=15) evaluated the AI-generated conversations using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Implementation of SMDP significantly improved the performance of the counselor AI on all MITI global assessments compared to zero-shot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. The evaluator AI performed similarly to human raters in facilitating change conversations, but systematically overestimated maintenance conversation de-escalation and overall quality metrics. Gemini showed model-specific biases, such as prioritizing power sharing, o3 prioritizing technical proficiency, and Sonnet prioritizing emotional expression. The client AI simulations showed a limited emotional range and unusually high compliance, suggesting the need for improved realism. These results set a benchmark for non-English AI-assisted counseling and suggest important areas for improvement through advanced prompt engineering, augmented search generation, and goal-oriented fine-tuning, with important implications for the development of culturally sensitive AI mental health tools.