This paper evaluates the performance of large-scale language models (LLMs) in the field of argument mining (AM) using various datasets (Args.me, UKP, etc.). By comparing and analyzing several LLMs such as GPT, Llama, and DeepSeek and inference-enhanced variants using Chain-of-Thoughts algorithms, ChatGPT-4o showed the best performance in general argument classification benchmarks, and Deepseek-R1 showed the best performance among models with added inference functions. However, even the best-performing models made errors, and we analyzed the types of such errors and suggested future directions for improvement. In addition, we pointed out __T80991__ of the existing prompt algorithm and presented an in-depth analysis of the shortcomings of the argument dataset used. This study is evaluated as the first extensive analysis of Args.me and UKP datasets using LLM and prompt algorithms.