Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks

An Explainable AI based approach for Monitoring Animal Health

Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection

Preacher: Paper-to-Video Agentic System

TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations

RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence, and Efficiency

The Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents

MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs

ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges

On Approximate MMS Allocations on Restricted Graph Classes

Request-Only Optimization for Recommendation Systems

Exploring Superior Function Calls via Reinforcement Learning

Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS

Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

StoryEnsemble: Enabling Dynamic Exploration & Iteration in the Design Process with AI and Forward-Backward Propagation

HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

A Segmented Robot Grasping Perception Neural Network for Edge AI

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Text-to-Level Diffusion Models With Various Text Encoders for Super Mario Bros

ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

A Closer Look at Multimodal Representation Collapse

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

Convolutional Autoencoders for Data Compression and Anomaly Detection in Small Satellite Technologies

SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control

Once Upon an AI: Six Scaffolds for Child-AI Interaction Design, Inspired by Disney

L3AC: Towards a Lightweight and Lossless Audio Codec

EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing

Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Human-AI Experience in Integrated Development Environments: A Systematic Literature Review

Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis

Language-Based Bayesian Optimization Research Assistant (BORA)

Data Diversity as Implicit Regularization: How Does Diversity Shape the Weight Space of Deep Neural Networks?

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Clean-Label Physical Backdoor Attacks with Data Distillation

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

JMA: a General Algorithm to Craft Nearly Optimal Targeted Adversarial Example

Large-Scale Multi-Robot Assembly Planning for Autonomous Manufacturing

Recent Advances in Generative AI for Healthcare Applications

PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models

CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking

Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

Sketch Decompositions for Classical Planning via Deep Reinforcement Learning

Tool-Planner: Task Planning with Clusters across Multiple Tools

MetaAgents: Large Language Model Based Agents for Decision-Making on Teaming

Sophisticated Learning: A novel algorithm for active learning during model-based planning

Is ChatGPT-5 Ready for Mammogram VQA?

Controlling Multimodal LLMs via Reward-guided Decoding

Pretrained Conformers for Audio Fingerprinting and Retrieval

CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow

Weighted First Order Model Counting for Two-variable Logic with Axioms on Two Relations

Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies

Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning

Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Informative Post-Hoc Explanations Only Exist for Simple Functions

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing

An Exploratory Study on Crack Detection in Concrete through Human-Robot Collaboration

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis

Retrieval-augmented reasoning with lean language models

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

Does the Skeleton-Recall Loss Really Work?

Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Leveraging the RETFound foundation model for optic disc segmentation in retinal images

NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models

RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading

SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Vision-Language Models display a strong gender bias

Hallucination in LLM-Based Code Generation: An Automotive Case Study

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Created by

Haebom

Author

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Outline

GLM-4.1V-Thinking and GLM-4.5V are vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. This paper shares key findings from the development of an inference-driven training framework. Large-scale pretraining was used to develop promising vision-based models, and reinforcement learning and curriculum sampling (RLCS) were then proposed to enhance the models' capabilities across a variety of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long-text interpretation. In a comprehensive evaluation on 42 public benchmarks, GLM-4.5V achieved state-of-the-art performance across nearly all tasks among similarly sized open-source models, and was competitive or superior to closed-source models such as Gemini-2.5-Flash on challenging tasks such as coding and GUI agents. The smaller GLM-4.1V-9B-Thinking model also maintained its competitiveness, outperforming the much larger Qwen2.5-VL-72B model on 29 benchmarks. Both GLM-4.1V-9B-Thinking and GLM-4.5V are open source.

Takeaways, Limitations

•

Takeaways:

◦

An effective VLM training framework combining large-scale pre-training and RLCS is presented.

◦

Introducing the GLM-4.1V-Thinking and GLM-4.5V models, which demonstrate excellent performance across a variety of tasks.

◦

Securing competitiveness with closed-source models as an open-source model.

◦

Demonstrated superior performance relative to model size.

•

Limitations:

◦

The paper lacks specific references to Limitations or future research directions.

◦

Since performance comparisons are mainly conducted for specific tasks, an in-depth analysis of the model's generalization ability is required.

◦

Despite being open source, the complexity of the model may lead to accessibility issues.

View PDF

Made with Slashpage