프롬프트 평가 자동화를 위한 연구 - 프롬프트 프로젝트

KLEAR Textbook 교과서 Activity Tool 제작 프로젝트

더 프롬프트 컴퍼니는 2024년 1월부터 한국어 교육을 위한 AI Activity 툴과 챗봇을 제작하고 있습니다. 이 프로젝트는 미국 UH Press 출판사, University of Hawaii at Manoa, EALL의 한국어과가 함께하고 있습니다. 2024년 봄학기에는 하와이 대학교의 초급 한국어를 수강하는 학생을 대상으로 툴과 챗봇의 파일럿 테스트를 마쳤는데요, 1차 프로젝트에서는 AI 툴과 챗봇 48개를 제작했습니다. 프롬프트만으로 LLM의 한계를 보완하고 학생의 수준에 맞춘 의미있는 학습 도구를 제공할 수 있다는 성과가 있었습니다. 이 글에서는 1차 프로젝트에 대한 소개, 프로젝트 배경, 프로젝트 진행 방법, 설문 조사를 담은 연구 내용을 담았습니다. 2024년 미국 인디애나 주 Indiana University 에서 열린 제29회 AATK 학회에 채택되었습니다. Abstract Enhancing Korean Language Learning through AI and Chatbots: An In-Depth Study of Efficacy, User Experiences, and Challenges The incorporation of Artificial Intelligence (AI) has become widespread in the field of language learning and teaching over the past decade. One form of AI, the chatbot, has gained popularity in language education for its ability to facilitate student learning in various aspects. This study focuses on evaluating the impact of these technologies on the Korean language learning experience, particularly at the beginner level. Its aim is to explore the effectiveness of generative AI in facilitating language acquisition, with a special emphasis on learners' engagement and perceptions. Empirical studies by Zhang and Aslan (2021), Schmidt and Strasser (2022), Jeon (2021), and Aihua (2021), assert the potential benefits of AI chatbots in language learning. Language learning applications like Duolingo and Babbel utilize AI to provide personalized feedback, serving as valuable adjunct tools in the language learning experience. However, applying generative AIs to language learning presents unique challenges due to their peculiar characteristics. Therefore, this study developed AI tools and chatbots utilizing the advanced capabilities of ChatGPT-3.5 and ChatGPT-4 models, tailoring them specifically for Korean language learning contexts. The current study involves the development of 48 unique AI tools and chatbots using advanced prompt engineering techniques. Designed to reinforce key language elements such as vocabulary, sentence structure, and verb conjugations, these chatbots were programmed to simulate real-world conversations, mirroring the scenarios presented in each textbook lesson. This functionality provides real-time feedback and correction on grammatical errors and vocabulary misuse. The primary aim is to create an immersive learning environment where learners can apply and test their language skills in practical, conversational contexts. Using the pilot test of these chatbots, a detailed user survey was conducted with 50 learners at a university in the US who had prior experience in learning Korean. This survey sought to understand various aspects of their learning process, including their firsthand experiences after using the specifically chosen AI tools and chatbots for this survey. According to the survey results, 70% of participants affirmed the utility of AI supplementary tools in their language progress. The conjugation tool, in particular, received high praise for its effectiveness in practicing grammar and vocabulary in a conversational setting. Notably, more than half of the participants acknowledged that AI tools positively influenced their understanding of Korean, especially in recognizing the subtle differences between Korean and English and appreciating the intricate details of language learning. However, the study also highlighted significant limitations and challenges. A notable issue was the occurrence of AI-generated inaccuracies or 'hallucinations', observed in about 20% of the cases. Participants reported instances where the AI provided incorrect sentence structures or tense forms, underlining its imperfection as a learning aid. The AI tools also struggled with nuances, specific contexts, and appropriate situational responses. The findings suggest that while AI can be a powerful aid in language learning, educators and developers must address its current limitations and ensure learners are well-informed about these technologies. Results and Insights: Positive Feedback: 70% of participants affirmed the utility of AI supplementary tools in their language learning progress. The conjugation tool was particularly praised for its effectiveness in practicing grammar and vocabulary in conversational settings. Improved Understanding: More than half of the participants acknowledged that AI tools positively influenced their understanding of Korean. They noted an enhanced ability to recognize subtle differences between Korean and English and appreciated the intricate details of language learning facilitated by these tools.

Sujin_Kang

2024/07/19 6:31 PM

프롬프트 평가 자동화를 위한 연구

휴먼 작업자의 대화데이터 레이블링 LLM에 input(프롬프트)을 넣고 답변을 평가하기 위한 정량적 벤치마크는 많습니다. Archive에는 이런 metrics들이 쏟아져 나옵니다. 그런데, 이들은 대부분 "정답"이 있는 질문에 대해 언어 모델이 얼마나 답을 정확하게 맞췄느냐에 중점을 둡니다. 수학, 산술, 일반 상식 문제 같은 것들이요. 하지만 사용자의 프롬프트는 답이 없는 경우가 많아요. "정성적"인 접근이 필요해요. 어떤 모델의 답변을 좋다고 할 수 있는지, 좋다면 왜 좋은지, 그 기준은 무엇인지를 평가해야해서 어려운 점이 많습니다. 그래서 신뢰할만한 정성적인 메트릭은 찾기 어렵습니다. ✅ "정성적"인 메트릭스 연구 프롬프트 평가 자동화 연구를 한창하고 있습니다. 정답은 언어 모델의 결과를 받은 사용자(end-user)가 얼마나 만족하고 불만족하는지에 있다고 생각해요. 생성형 AI가 대화형 인터페이스이기 때문에, Turn의 구조를 보면 알 수 있는 것들이 많아요. ✅ 대화 분석학 선호/비선호 구조 (preferred and dispreferred organization) 사용자가 언어 모델의 답이 마음에 들었으면, preferred 구조를 마음에 들지 않았으면 dispreferred organization 의 턴 구조가 확연이 드러나요. Explicit 한 언어로 말이죠. 그럼, 만족/불만족하게 한 원인이 무엇일지 대화 상황에서 찾아보는 것으로 메트릭을 잡을 수 있습니다. ✅ LLMs vs 인간의 프롬프트 답변 평가 메트릭을 가지고, 각 프롬프트와 결과값을 평가하는 단계인데요. 예를들어 100개의 대화 데이터셋이라면, 10개의 메트릭을 두고 LLM과 인간이 평가하는거예요. 이 과정에서 나누고 싶은 경험이 있습니다. LLM은 몇 회에 걸쳐 평가를 하더라도, 자기 일관성이 뛰어납니다. 시간도 사람보다 절대적으로 빠릅니다. 그런데, 사람은 한 번 채점하고 두 번 했을 때 자기 일관성이 매우 떨어져요. 프롬프트 자동화 메트릭을 만들던 초기에는, 인간이 무조건 LLM보다 뛰어나다라고 믿었습니다. 네 명의 친구에게 평가 작업을 시켜봤습니다. 이들이 평가한 작업량은, ✔ 900개 턴(single turn/multi turn포함, 약 17,000개)* LLM 3종류 = 51,000 문장입니다 😢 네 명 중 세 명은 중도 포기를 했고, 한 명만 전체 분량의 50% 를 완료했는데, 결과가 처참했습니다. 아무래도 막 점수를 매긴 것이 아닌가해요...그래프를 보시면, 사람 간에도 일관성이 떨어지고, 모델이 평가한 것과 사람간의 결과에도 일치하지 않습니다. 색이 진할 수록 일관성이 떨어짐을 의미하는 그래프 입니다.

Sujin_Kang

2024/07/19 5:48 PM