Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

Created by
  • Haebom

Author

Wael Etaiwi, Bushra Alhijawi

Outline

In this paper, we evaluate the performance of ChatGPT and DeepSeek, two large-scale language models (LLMs), across five major natural language processing (NLP) tasks: sentiment analysis, topic classification, text summarization, machine translation, and text entailment. To ensure fairness and minimize variability, we use a structured experimental protocol to test both models with the same neutral prompts and evaluate them on two benchmark datasets (news, reviews, formal/informal texts, etc.) for each task. Our experiments show that DeepSeek outperforms in classification stability and logical reasoning, while ChatGPT outperforms in tasks requiring fine-grained understanding and flexibility.

Takeaways, Limitations

Takeaways:
Provides insight into choosing the right LLM for your specific NLP task.
Clearly present the strengths and weaknesses of ChatGPT and DeepSeek.
Comparative analysis of LLM performance on various NLP tasks.
Emphasize the importance of experimental protocol (fairness and minimizing variability).
Limitations:
The evaluated LLMs are limited to ChatGPT and DeepSeek. Research is needed that includes more diverse LLMs.
Limited number of NLP tasks evaluated. Need to evaluate a wider range of NLP tasks.
Further review is needed on the generalizability of the benchmark dataset used.
Lack of consideration of the impact of prompt engineering.
👍