This study systematically evaluated the ethical decision-making ability and potential bias of large-scale language models (LLMs) using two models, GPT-3.5 Turbo and Claude 3.5 Sonnet, to assess their responses to ethical dilemmas. We analyzed the ethical preferences, sensitivity, stability, and clustering of preferences of the models across 11,200 experiments that included multiple protected attributes, including age, gender, race, appearance, and disability status. The results revealed consistent preferences for certain attributes (e.g., “good-looking”) and systematic disregard for other attributes in both models. GPT-3.5 Turbo showed strong preferences consistent with existing power structures, while Claude 3.5 Sonnet showed a wider range of protected attribute choices. Furthermore, we found that ethical sensitivity decreased significantly in more complex scenarios involving multiple protected attributes. We found that linguistic references significantly influenced the models’ ethical evaluations, as evidenced by their different responses to racial descriptors (“Yellow” vs. “Asian”). This study highlights important concerns about the potential impact of LLM bias in autonomous decision-making systems and emphasizes the need to carefully consider protective properties in AI development.