This paper addresses the problem of distinguishing between human-generated and LLM-generated texts, taking into account the risks associated with the misuse of large-scale language models (LLMs). To this end, we investigate the detection and explanation capabilities of current LLMs in two settings: binary classification (human vs. LLM-generated) and ternary classification (with undetermined classes). We evaluate six open-source and closed-source LLMs of varying sizes and find that self-detection, where LLMs identify their own outputs, consistently outperforms cross-detection, where LLMs identify the outputs of other LLMs, but is suboptimal in both cases. Introducing a ternary classification framework improves detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using human-annotated datasets, we identify major explanation failures, mainly reliance on incorrect features, hallucinations, and faulty inferences. As a result, we highlight the limitations of current LLMs in self-detection and self-explanation, and emphasize the need for further research to address overfitting and improve generalization ability.