Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

Created by
  • Haebom

Author

Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish, Shammur Absar Chowdhury

Outline

This paper addresses the need for the development of large-scale resources that focus on multilingual, regional, and cultural contexts to address concerns about cultural bias, fairness, and applicability of large-scale language models (LLMs) in diverse languages and low-resource regions. To this end, we propose the NativQA framework, which can seamlessly build large-scale question-answering (QA) datasets tailored to diverse cultures and regions by leveraging user-defined seed queries and retrieving site-specific everyday information from search engines. The evaluations across 24 countries, 39 regions, and 7 languages (from low- to high-resource languages) yielded over 300,000 question-answer pairs that can be used for LLM benchmarking and further fine-tuning. The NativQA framework is publicly available ( https://gitlab.com/nativqa/nativqa-framework ).

Takeaways, Limitations

Takeaways:
Provides an efficient framework for building large-scale QA datasets that take into account multilingual, regional, and cultural contexts.
Evaluate and improve the performance of LLM in a variety of language environments, including low-resource languages
Facilitating research community engagement and advancement through open frameworks
Limitations:
Search engine dependency: The quality of your dataset may be affected by the quality of search engine results.
Potential for regional bias: Data collection may be biased depending on the region.
Size of the dataset: 300K QA pairs may not be enough for large-scale LLM training.
Generalizability of the framework: Further research is needed on its applicability to other linguistic and cultural contexts.
👍