Share
Sign In
프롬프트 연구
Prompting tips for Llama 2&3 Models
S
Sujin_Kang
👍
1
Open-source Prompt Engineering with Llama-2
DeepLearning.AI 에서 제공하는 한 시간 분량의 무료 강의 입니다.
Meta의 Senior Director of Partner Engineering Amit Sangani 가 Llama 2&3 모델 사용을 위한 프롬프트 엔지니어링 기법을 설명한 영상입니다.
Prompting
The words you choose when you prompt the model affect how it responds.
Prompt engineering is the science and the art of communicating with a large language model so that it responds or behaves in a way that's useful for you.
영상을 듣고 Prompt Engineering 파트에서 중요한 내용을 정리했습니다.
Prompting Llama models
[INST] 생일 카드 문구를 작성해줘 [/INST]
- instruction tags
### base model prompt = "What is the capital of France?" response = llama(prompt, verbose=True, add_inst=False, model="togethercomputer/llama-2-7b")
prompt: [INST] What is the capital of France? [/INST] model=togethercomputer/llama-2-7b-chat print(response) The capital of France is Paris.
! [INST] 와 [/INST] 를 포함하지 않은 프롬프트는 False로 처리됨
기본 프롬프트 엔지니어링 Tips
1.
Providing examples of the task you are trying to carry out
2.
Specifying how to format responses
3.
Requesting that the model assume a particular "role or persona" when creating its response.
4.
Including additional infromation or data for the model to use in its response.
In-context learning
LLM은 프롬프트에 있는 예제로부터 하려고 하는 task 를 결정한다. 아래 예시처럼, 문장으로 sentiment를 요청하지 않아도, sentiment: 만으로도 해당 과제를 수행함.
prompt = f""" Message: Hi Amit, I loved my birthday card! Sentiment: """
response = f""" sentiment: Positive """
Zero-shot Prompting
you only provide the structure to the model, but without any examples of the completed task.
Few-shot Prompting
(1) 모델에 structure 를 주는 것 뿐 아니라, 2개 이상의 예시를 주는 방법
structure: Message: 의 형태
(2) 출력 형태 구체화: 한 단어로만 답변해
prompt = """ Message: Hi Dad, you're 20 minutes late to my piano recital! Sentiment: Negative Message: Can't wait to order pizza for dinner tonight Sentiment: Positive Message: Hi Amit, thanks for the thoughtful birthday card! Sentiment: ? Give a one word response. """ response = llama(prompt) print(response)
! 모델이 작으면 프롬프트에 모델이 이해하기 쉽도록 지시를 해줘야함. 포맷을 더 작은 범위로 제한하는 것이 좋음.
e.g., 한 단어의 응답으로 말해 → 긍정, 부정, 중립 중에서 대답해.
Role Prompting
역할(Role)은 LLMs(대형 언어 모델)에게 어떤 유형의 답변을 원하는지를 설명함.
Llama 2는 역할을 제공받았을 때 보다 일관된 응답을 자주 제공.
Summarization
이메일의 내용을 {email}로 처리하여, 요약하게 한 다음 결과를 출력하는 방법.
prompt = f""" Summarize this email and extract some key points. What did the author say about llama models?: email: {email} """ response = llama(prompt) print(response)
Providing New Information in the Prompt
모델은 학습한 지식까지만 알 수 있음. 학습날짜 이후의 최근 사건에 대해서는 알지 못함.
최근 사건에 대해 물을 때는, 정보를 추가로 프롬프트에 제공하여 정답을 도출할 수 있음. (아래 context 부분 wikipedia 에서 정보를 가져와 복사해서 붙여넣은 것)
context: {context} 로 처리함.
prompt = """ Who won the 2023 Women's World Cup? """ context = """ The 2023 FIFA Women's World Cup (Māori: Ipu Wahine o te Ao FIFA i 2023)[1] was the ninth edition of the FIFA Women's World Cup, the quadrennial international women's football championship contested by women's national teams and organised by FIFA. The tournament, which took place from 20 July to 20 August 2023, was jointly hosted by Australia and New Zealand.[2][3][4] It was the first FIFA Women's World Cup with more than one host nation, as well as the first World Cup to be held across multiple confederations, as Australia is in the Asian confederation, while New Zealand is in the Oceanian confederation. It was also the first Women's World Cup to be held in the Southern Hemisphere.[5] This tournament was the first to feature an expanded format of 32 teams from the previous 24, replicating the format used for the men's World Cup from 1998 to 2022.[2] The opening match was won by co-host New Zealand, beating Norway at Eden Park in Auckland on 20 July 2023 and achieving their first Women's World Cup victory.[6] Spain were crowned champions after defeating reigning European champions England 1–0 in the final. It was the first time a European nation had won the Women's World Cup since 2007 and Spain's first title, although their victory was marred by the Rubiales affair.[7][8][9] Spain became the second nation to win both the women's and men's World Cup since Germany in the 2003 edition.[10] In addition, they became the first nation to concurrently hold the FIFA women's U-17, U-20, and senior World Cups.[11] Sweden would claim their fourth bronze medal at the Women's World Cup while co-host Australia achieved their best placing yet, finishing fourth.[12] Japanese player Hinata Miyazawa won the Golden Boot scoring five goals throughout the tournament. Spanish player Aitana Bonmatí was voted the tournament's best player, winning the Golden Ball, whilst Bonmatí's teammate Salma Paralluelo was awarded the Young Player Award. England goalkeeper Mary Earps won the Golden Glove, awarded to the best-performing goalkeeper of the tournament. Of the eight teams making their first appearance, Morocco were the only one to advance to the round of 16 (where they lost to France; coincidentally, the result of this fixture was similar to the men's World Cup in Qatar, where France defeated Morocco in the semi-final). The United States were the two-time defending champions,[13] but were eliminated in the round of 16 by Sweden, the first time the team had not made the semi-finals at the tournament, and the first time the defending champions failed to progress to the quarter-finals.[14] Australia's team, nicknamed the Matildas, performed better than expected, and the event saw many Australians unite to support them.[15][16][17] The Matildas, who beat France to make the semi-finals for the first time, saw record numbers of fans watching their games, their 3–1 loss to England becoming the most watched television broadcast in Australian history, with an average viewership of 7.13 million and a peak viewership of 11.15 million viewers.[18] It was the most attended edition of the competition ever held. """ prompt = f""" Given the following context, who won the 2023 Women's World cup? context: {context} """ response = llama(prompt) print(response)
이제 이 프롬프트를 다음처럼 수정할 수 있음.
context = """ <paste context in here> """ query = "<your query here>" prompt = f""" Given the following context, {query} context: {context} """
Chain-of-thought Prompting
llama 도 "think-step-by-step"을 하면 수학 문제에서의 추론 성능이 올라감.
문제를 잘 풀려면 추가 지시문(instructions) 도 중요함
역시 instructions 의 순서도 중요함 ('answer first' vs. 'answer later' 결과 다름)
LLMs는 답변을 한 토큰씩 예측하기 때문에, 최선의 방법은 모델에게 단계별로 생각하도록 요청한 다음, reasoning(추론)을 설명한 후에만 답변을 제공하도록 해야함.
아래 예시에서, A(설명하고 → 답을 제공하라는) 프롬프트가 추론 성능이 좋음.
B예시는 답을 먼저 제공하고 → 해설은 나중에 하라고 하는 프롬프트, 성능이 A보다 떨어짐
A. (math problem) Think step by step. Explain each intermediate step. Only when you are done with all your steps, provide the answer based on your intermediate steps. """
B. (math problem) Think step by step. Provide the answer as a single yes/no answer first. Then explain each intermediate step. """
출처:
#promptengineering #promptingtips #Llama2 #Llama3 #프롬프트엔지니어링
Subscribe to 'sujin-prompt-engineer'
안녕하세요,
슬래시페이지 구독을 하시면, 이따금씩 발행하는 글을 이메일로 받아보실 수 있어요.
구독하시겠어요?
Subscribe
👍
1
Sujin_Kang
Finetuned Language Models Are Zero-Shot Learners (2022)
Finetuned Language Models Are Zero-Shot Learners Summary of Key Points: - Instruction Tuning: The study presents a method called "instruction tuning" which fine-tunes large language models on a wide range of tasks expressed as natural language instructions, improving zero-shot performance on unseen tasks. Performance of FLAN: A 137B parameter model (called FLAN) was instruction-tuned and evaluated on unseen tasks. It outperformed GPT-3’s zero-shot and even its few-shot performance on multiple benchmarks. - Key Results: FLAN achieved better results in natural language inference, reading comprehension, and closed-book QA compared to GPT-3, especially on tasks it wasn’t directly trained on. - Scaling and Effectiveness: Instruction tuning becomes increasingly beneficial as model scale increases, with results showing better generalization to new tasks as the number of instruction-tuned tasks increases. Significance of the Study: Improvement in Zero-Shot Learning: This work demonstrates how instruction tuning significantly enhances zero-shot learning capabilities in large language models, enabling them to perform better on unseen tasks with minimal prompt engineering. Broader Applicability: The study indicates that instruction tuning can generalize models across a wider range of unseen tasks, potentially lowering the barrier to their application in diverse NLP problems. Surpassing GPT-3: FLAN’s performance surpasses GPT-3 in many zero-shot and few-shot tasks, showcasing the potential of instruction-tuned models to reduce the need for extensive fine-tuning for individual tasks. Limitations: - Smaller Model Issues: Instruction tuning does not help smaller models (less than 8B parameters), and in some cases, it actually worsens performance on unseen tasks. - Commonsense and Coreference Resolution Tasks: Instruction tuning was found to be less effective for tasks like commonsense reasoning and coreference resolution, where task-specific instructions might be redundant. - Limited Exploration of Instructions: The study only explored short, single-sentence instructions. More complex or detailed instructions were not evaluated. - Training Cost: The model’s large size (137B parameters) makes it computationally expensive to train and serve, limiting its practical deployment in certain environments. Imortant Graphs: (1) Shows how instruction tuning improves performance across different tasks (Natural Language Inference, Reading Comprehension, and Closed-Book QA) compared to GPT-3 and other models. (2) Illustrates zero-shot performance improvement for FLAN across Natural Language Inference, Reading Comprehension, and Closed-Book QA. (3) Demonstrates that adding more task clusters during instruction tuning improves performance across held-out clusters, showing no saturation in performance gains as more tasks are added.
Sujin_Kang
GPT-4 Technical Report (2024)
GPT-4 Technical Report (2024) Summary of Key Points: - Development of GPT-4: GPT-4 is a large-scale multimodal model, capable of processing both image and text inputs and producing text outputs. It outperforms previous models on academic and professional benchmarks, showing human-level performance on tasks such as the simulated bar exam. - Performance and Capabilities: GPT-4 significantly improves on tasks in various languages and domains. It was tested on multiple-choice exams, coding tasks, and benchmarks such as MMLU, where it performed exceptionally well in different languages. - Safety and Limitations: Despite its advancements, GPT-4 has limitations such as hallucinations, reasoning errors, and biases. OpenAI has made improvements to reduce these, especially in adversarial settings. - Predictable Scaling: A notable feature of GPT-4's development is its infrastructure, which allowed accurate prediction of performance from models trained with significantly less compute power. - Ethical Considerations: The model introduces new risks due to its capabilities, and efforts have been made to address harmful outputs via reinforcement learning and adversarial testing. Significance of the Study: - AI Advancements: GPT-4 represents a significant leap in AI capability, particularly in natural language understanding across multiple languages and tasks. - Ethical and Safety Developments: The study highlights the need for safe AI deployment and introduces methods such as red-teaming and rule-based reward models to mitigate risks. - Benchmark in AI Progress: Its human-level performance on complex benchmarks marks GPT-4 as a key model in AI development, setting a new standard for NLP systems. Important Graphs: (1) Illustrates GPT-4’s final loss prediction against smaller models, highlighting the power law fit for scaling predictions. (2) Shows GPT-4’s performance on various academic and professional exams, compared to GPT-3.5, showcasing its superiority across a range of subjects. (3) Demonstrates GPT-4’s improvements in factuality, showing a 19% increase in performance on adversarial factual evaluations compared to previous models. Limitations: - Hallucinations and Errors: GPT-4 still generates incorrect or nonsensical outputs in some cases, particularly when handling complex tasks or novel inputs. - Overreliance: It may give incorrect but confident responses, leading users to trust it more than they should. - Limited Context Window: Despite improvements, GPT-4 has limitations in handling long-term dependencies across large contexts. Biases: Efforts have been made to reduce bias, but the model still exhibits biases in certain situations. Sample Prompt: content warning: contains graphic erotic content Example "Jailbreaks" for GPT-4-launch
Sujin_Kang
Language Models are Few-Shot Learners_OpenAI (2020)
Tittle: Language models are few-shot learners Key Summary: Objective: The paper investigates the capability of large-scale language models, particularly GPT-3, to perform a wide range of natural language processing (NLP) tasks without task-specific fine-tuning, using few-shot, one-shot, and zero-shot learning. GPT-3: The model has 175 billion parameters, ten times larger than previous non-sparse models. It utilizes an autoregressive architecture and is trained on a massive corpus of text. GPT-3’s size enables it to generalize tasks better with minimal task-specific data. Few-Shot Learning: GPT-3 shows strong performance across various NLP tasks, such as translation, question-answering, reading comprehension, and arithmetic, without fine-tuning. It uses few-shot learning, meaning it can adapt to a new task by seeing a few examples (often 10-100) within its context window. Tasks Covered: The model demonstrates strong performance in tasks such as: Language modeling and completion tasks: Predicting missing words and completing sentences. Question-answering: Particularly strong in TriviaQA and CoQA datasets. Translation: Demonstrates moderate capability in English-French and other language pairs. Commonsense reasoning and arithmetic: Performs well on commonsense tasks and basic math. Societal Impact: The paper discusses the ethical implications of GPT-3’s capabilities, including potential misuse in generating misleading or fake content, concerns about bias in its outputs, and environmental impact due to large-scale model training. Significance of This Article: Few-Shot Learning Breakthrough: GPT-3 demonstrated that large-scale models can perform well on many tasks with minimal examples, without requiring task-specific fine-tuning. Scaling Impact: With 175 billion parameters, GPT-3 showed that scaling up model size significantly improves performance across a variety of tasks. In-Context Learning: The model can adapt to new tasks through prompts, mimicking human-like learning without altering the model’s parameters. Societal and Ethical Concerns: While it enables human-like text generation, it also raises concerns about misuse, bias, and fairness. Foundational Impact: It set the stage for future AI models, influencing research directions and practical applications in industries like content creation and customer service. Limitations: Few-shot struggles: Despite GPT-3’s overall success, it struggles with certain types of tasks, such as natural language inference (NLI) and some reading comprehension datasets, indicating that its ability to understand and compare sentence-level meaning remains limited. Data contamination: GPT-3’s training on large web corpora presents a challenge with data contamination. This occurs when the test set overlaps with training data from sources like Common Crawl, which inflates the model's performance on some benchmarks.