In this paper, we present a multi-agent emergent behavior evaluation (MAEBE) framework for assessing new risks that arise as AI systems composed of multiple agents become more widespread. Using MAEBE, the Greatest Good Benchmark, and a novel double-reversed questioning technique, we show that (1) LLM moral preferences, especially preferences for instrumental harms, vary significantly across questioning styles and are fragile both in single agents and in ensembles, (2) moral reasoning in LLM ensembles cannot be predicted from the behavior of single agents alone due to emergent group dynamics, and (3) convergence occurs in ensembles, particularly due to phenomena such as peer pressure, even with supervisory guidance. This highlights the need to evaluate AI systems in interacting multi-agent environments.