Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Lessons from Studying Two-Hop Latent Reasoning

Created by
  • Haebom

Author

Mikita Balesni, Tomek Korbak, Owain Evans

Outline

This paper investigates the potential inference capabilities of large-scale language models (LLMs), specifically their ability to combine two facts through two-step question answering. Previous research has shown that LLMs struggle with two-step question answering without a CoT (Coordination of the Thinking Process). This study fine-tunes LLMs using synthetic facts, thereby assessing their pure inference capabilities without memorization or inference shortcuts. Experiments with models such as Llama 3 8B and GPT-4o show that while these models fail to combine two synthetic facts, they succeed in combining one synthetic fact with one natural language fact. This suggests that LLMs have potential two-step inference capabilities, but it remains unclear how this capability scales with model size. Finally, we emphasize the importance of LLM inference researchers to avoid both false successes due to memorization or inference shortcuts and false failures due to artificial experimental setups when drawing conclusions about the potential inference capabilities of LLMs.

Takeaways, Limitations

Takeaways: We presented a controlled experimental setup demonstrating that LLMs potentially possess two-stage reasoning abilities. Using synthetic data, we assessed pure reasoning abilities, eliminating memorization or shortcuts. We also presented methods to avoid false successes and failures when studying LLMs' reasoning abilities.
Limitations: It is unclear how the two-step inference capability of LLM scales with model size. The success of combining synthetic and natural language data suggests that further research is needed to understand the model's inference capabilities in general. Further validation is needed to determine whether the proposed experimental setup can be generalized to all types of two-step inference problems.
👍