Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks

An Explainable AI based approach for Monitoring Animal Health

Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection

Preacher: Paper-to-Video Agentic System

TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations

RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence, and Efficiency

The Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents

MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs

ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges

On Approximate MMS Allocations on Restricted Graph Classes

Request-Only Optimization for Recommendation Systems

Exploring Superior Function Calls via Reinforcement Learning

Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS

Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

StoryEnsemble: Enabling Dynamic Exploration & Iteration in the Design Process with AI and Forward-Backward Propagation

HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

A Segmented Robot Grasping Perception Neural Network for Edge AI

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Text-to-Level Diffusion Models With Various Text Encoders for Super Mario Bros

ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

A Closer Look at Multimodal Representation Collapse

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

Convolutional Autoencoders for Data Compression and Anomaly Detection in Small Satellite Technologies

SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control

Once Upon an AI: Six Scaffolds for Child-AI Interaction Design, Inspired by Disney

L3AC: Towards a Lightweight and Lossless Audio Codec

EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing

Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Human-AI Experience in Integrated Development Environments: A Systematic Literature Review

Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis

Language-Based Bayesian Optimization Research Assistant (BORA)

Data Diversity as Implicit Regularization: How Does Diversity Shape the Weight Space of Deep Neural Networks?

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Clean-Label Physical Backdoor Attacks with Data Distillation

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

JMA: a General Algorithm to Craft Nearly Optimal Targeted Adversarial Example

Large-Scale Multi-Robot Assembly Planning for Autonomous Manufacturing

Recent Advances in Generative AI for Healthcare Applications

PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models

CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking

Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

Sketch Decompositions for Classical Planning via Deep Reinforcement Learning

Tool-Planner: Task Planning with Clusters across Multiple Tools

MetaAgents: Large Language Model Based Agents for Decision-Making on Teaming

Sophisticated Learning: A novel algorithm for active learning during model-based planning

Is ChatGPT-5 Ready for Mammogram VQA?

Controlling Multimodal LLMs via Reward-guided Decoding

Pretrained Conformers for Audio Fingerprinting and Retrieval

CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow

Weighted First Order Model Counting for Two-variable Logic with Axioms on Two Relations

Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies

Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning

Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Informative Post-Hoc Explanations Only Exist for Simple Functions

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing

An Exploratory Study on Crack Detection in Concrete through Human-Robot Collaboration

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis

Retrieval-augmented reasoning with lean language models

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

Does the Skeleton-Recall Loss Really Work?

Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Leveraging the RETFound foundation model for optic disc segmentation in retinal images

NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models

RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading

SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Vision-Language Models display a strong gender bias

Hallucination in LLM-Based Code Generation: An Automotive Case Study

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Created by

Haebom

Author

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

Outline

TokLIP is a visual tokenizer that semantizes vector quantization (VQ) tokens and integrates CLIP-level semantics to address the high training computational overhead and limited comprehension performance caused by a lack of high-level semantics. It enables end-to-end multimodal autoregressive training while utilizing existing VQ tokens. It captures high-level continuous semantics by integrating a low-level discrete VQ tokenizer with a ViT-based token encoder. Unlike existing methods that discretize high-level features (e.g., VILA-U), TokLIP separates the training objectives for comprehension and generation, enabling direct application of the advanced VQ tokenizer without custom quantization operations. Experimental results demonstrate that TokLIP achieves excellent data efficiency, providing visual tokens with high-level semantic comprehension capabilities and enhancing low-level generation capabilities, making it suitable for autoregressive transformers for both comprehension and generation tasks. The code and model are available at https://github.com/TencentARC/TokLIP .

GitHub - TencentARC/TokLIP: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation - TencentARC/TokLIP

github.com

Takeaways, Limitations

•

Takeaways:

◦

Overcoming the limitations of existing token-based multimodal models by incorporating high-dimensional semantics.

◦

It has excellent data efficiency and simultaneously improves low-level generation capabilities and high-level semantic understanding capabilities.

◦

Facilitates end-to-end multimodal autoregressive training by directly leveraging the existing VQ tokenizer.

◦

Effective application to autoregressive transformers in both understanding and generation tasks.

•

Limitations:

◦

This paper does not explicitly mention the specific Limitations. Further experimental and comparative studies are needed to elucidate Limitations.

View PDF

Made with Slashpage