Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Created by
  • Haebom

Author

Jaskaranjeet Singh, Rakesh Thakur

PunGPT2: A Large-Scale Punjabi Language Model

Outline

Despite advances in large-scale language models (LLMs), low-resource languages remain underrepresented in NLP, limiting digital accessibility for millions. To address this, we present PunGPT2, a fully open-source generative model suite tailored for Punjabi. Trained on a 35GB corpus of literature, religious texts, news, and social discourse, it captures the syntactic and morphological richness of Punjabi through tokenizers optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval augmentation framework that integrates PunGPT2 with the FAISS retriever, and Pun-Instruct, which uses QLoRA for instruction-tuned zero-shot summarization, translation, and question answering. Furthermore, we develop Quantum-RAG, which fuses sparse, dense, and quantum kernel embeddings to enable efficient, context-aware retrieval with low memory overhead, marking the first practical implementation of quantum-inspired retrieval in low-resource LLMs. This model outperforms multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and the new PunjabiEval suite. Quantum-RAG achieves +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5 on PunjabiEval. By releasing the 35GB Punjabi corpus, the PunjabiEval benchmark, all model weights, training scripts, hyperparameters, and evaluation pipeline, we establish a new state-of-the-art in Punjabi generation and retrieval.

Takeaways, Limitations

Takeaways:
We have developed an LLM specifically for Punjabi, a low-resource language, to improve digital accessibility for its speakers.
We implemented efficient context-aware search and improved the performance of low-resource LLMs through an innovative search technology called Quantum-RAG.
By making all resources (data, models, code) open, we contribute to research and development related to Punjabi and promote advancement in the field.
We demonstrated the model's performance through various evaluation metrics and benchmarks.
Limitations:
Further explanation may be needed regarding the relevance of Quantum-RAG's quantum-inspired technology to practical quantum computing.
Further research is needed to assess generalizability to other low-resource languages.
A more in-depth analysis of the model's biases and ethical concerns is needed.
Consideration should be given to whether the 35GB corpus covers all aspects of Punjabi and whether additional data is needed.
👍