Finetuned Language Models Are Zero-Shot Learners (2022)
Finetuned Language Models Are Zero-Shot Learners Summary of Key Points: - Instruction Tuning: The study presents a method called "instruction tuning" which fine-tunes large language models on a wide range of tasks expressed as natural language instructions, improving zero-shot performance on unseen tasks. Performance of FLAN: A 137B parameter model (called FLAN) was instruction-tuned and evaluated on unseen tasks. It outperformed GPT-3’s zero-shot and even its few-shot performance on multiple benchmarks. - Key Results: FLAN achieved better results in natural language inference, reading comprehension, and closed-book QA compared to GPT-3, especially on tasks it wasn’t directly trained on. - Scaling and Effectiveness: Instruction tuning becomes increasingly beneficial as model scale increases, with results showing better generalization to new tasks as the number of instruction-tuned tasks increases. Significance of the Study: Improvement in Zero-Shot Learning: This work demonstrates how instruction tuning significantly enhances zero-shot learning capabilities in large language models, enabling them to perform better on unseen tasks with minimal prompt engineering. Broader Applicability: The study indicates that instruction tuning can generalize models across a wider range of unseen tasks, potentially lowering the barrier to their application in diverse NLP problems. Surpassing GPT-3: FLAN’s performance surpasses GPT-3 in many zero-shot and few-shot tasks, showcasing the potential of instruction-tuned models to reduce the need for extensive fine-tuning for individual tasks. Limitations: - Smaller Model Issues: Instruction tuning does not help smaller models (less than 8B parameters), and in some cases, it actually worsens performance on unseen tasks. - Commonsense and Coreference Resolution Tasks: Instruction tuning was found to be less effective for tasks like commonsense reasoning and coreference resolution, where task-specific instructions might be redundant. - Limited Exploration of Instructions: The study only explored short, single-sentence instructions. More complex or detailed instructions were not evaluated. - Training Cost: The model’s large size (137B parameters) makes it computationally expensive to train and serve, limiting its practical deployment in certain environments. Imortant Graphs: (1) Shows how instruction tuning improves performance across different tasks (Natural Language Inference, Reading Comprehension, and Closed-Book QA) compared to GPT-3 and other models. (2) Illustrates zero-shot performance improvement for FLAN across Natural Language Inference, Reading Comprehension, and Closed-Book QA. (3) Demonstrates that adding more task clusters during instruction tuning improves performance across held-out clusters, showing no saturation in performance gains as more tasks are added.