Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VarCoNet: A variability-aware self-supervised framework for functional connectome extraction from resting-state fMRI

KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting

SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

Pack and Force Your Memory: Long-form and Consistent Video Generation

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Analyzing Latent Concepts in Code Language Models

Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

YOLO-Based Defect Detection for Metal Sheets

Jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Putnam-like dataset summary: LLMs as mathematical competition contestants

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Observation-Free Attacks on Online Learning to Rank

MTRec: Learning to Align with User Preferences via Mental Reward Models

MobiLLM: An Agentic AI Framework for Closed-Loop Threat Mitigation in 6G Open RANs

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Flow-Induced Diagonal Gaussian Processes

Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach

Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

STORI: A Benchmark and Taxonomy for Stochastic Environments

A Study on the Framework for Evaluating the Ethics and Trustworthiness of Generative AI

Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLMs

FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Tuning LLM-based Code Optimization via Meta-Prompting: An Industrial Perspective

SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes toward Intelligent Vehicle Suspension Systems

An Architecture for Spatial Networking

A Comprehensive Review on Harnessing Large Language Models to Overcome Recommender System Challenges

First Hallucination Tokens Are Different from Conditional Ones

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Model Parallelism With Subnetwork Data Parallelism

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Controlled Generation with Equivariant Variational Flow Matching

CAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Semantic Preprocessing for LLM-based Malware Analysis

Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space

Permissioned LLMs: Enforcing Access Control in Large Language Models

Efficient Preimage Approximation for Neural Network Certification

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Pre-training Limited Memory Language Models with Internal and External Knowledge

OT Score: An OT based Confidence Score for Source Free Unsupervised Domain Adaptation

Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

A Survey of Deep Learning for Complex Speech Spectrograms

Continuous Thought Machines

CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

Activated LoRA: Fine-tuned LLMs for Intrinsics

Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement

DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation

A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Rethinking the Vulnerability of Concept Erasure and a New Method

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Graph Neural Networks for Transmission Grid Topology Control: Busbar Information Asymmetry and Heterogeneous Representations

Inferring Pluggable Types with Machine Learning

Optimizing Container Loading and Unloading through Dual-Cycling and Dockyard Rehandle Reduction Using a Hybrid Genetic Algorithm

LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Unified Domain Adaptive Semantic Segmentation

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CMDPs

Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Learning to Interact in World Latent for Team Coordination

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

GUI-PRA: Process Reward Agent for GUI Tasks

PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Efficient & Correct Predictive Equivalence for Decision Trees

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Gala: Global LLM Agents for Text-to-Model Translation

Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation

LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems

V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

MIRROR: Modular Internal Processing for Personalized Safety in LLM Dialogue

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution

Improved Monte Carlo Planning via Causal Disentanglement for Structurally-Decomposed Markov Decision Processes

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Created by

Haebom

Author

Jaskaranjeet Singh, Rakesh Thakur

PunGPT2: A Large-Scale Punjabi Language Model

Outline

Despite advances in large-scale language models (LLMs), low-resource languages remain underrepresented in NLP, limiting digital accessibility for millions. To address this, we present PunGPT2, a fully open-source generative model suite tailored for Punjabi. Trained on a 35GB corpus of literature, religious texts, news, and social discourse, it captures the syntactic and morphological richness of Punjabi through tokenizers optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval augmentation framework that integrates PunGPT2 with the FAISS retriever, and Pun-Instruct, which uses QLoRA for instruction-tuned zero-shot summarization, translation, and question answering. Furthermore, we develop Quantum-RAG, which fuses sparse, dense, and quantum kernel embeddings to enable efficient, context-aware retrieval with low memory overhead, marking the first practical implementation of quantum-inspired retrieval in low-resource LLMs. This model outperforms multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and the new PunjabiEval suite. Quantum-RAG achieves +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5 on PunjabiEval. By releasing the 35GB Punjabi corpus, the PunjabiEval benchmark, all model weights, training scripts, hyperparameters, and evaluation pipeline, we establish a new state-of-the-art in Punjabi generation and retrieval.

Takeaways, Limitations

•

Takeaways:

◦

We have developed an LLM specifically for Punjabi, a low-resource language, to improve digital accessibility for its speakers.

◦

We implemented efficient context-aware search and improved the performance of low-resource LLMs through an innovative search technology called Quantum-RAG.

◦

By making all resources (data, models, code) open, we contribute to research and development related to Punjabi and promote advancement in the field.

◦

We demonstrated the model's performance through various evaluation metrics and benchmarks.

•

Limitations:

◦

Further explanation may be needed regarding the relevance of Quantum-RAG's quantum-inspired technology to practical quantum computing.

◦

Further research is needed to assess generalizability to other low-resource languages.

◦

A more in-depth analysis of the model's biases and ethical concerns is needed.

◦

Consideration should be given to whether the 35GB corpus covers all aspects of Punjabi and whether additional data is needed.

View PDF

Made with Slashpage