Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Vector-Valued Reproducing Kernel Banach Spaces for Neural Networks and Operators

Auto-ARGUE: LLM-Based Report Generation Evaluation

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Dolphin v1.0 Technical Report

Economic Competition, EU Regulation, and Executive Orders: A Framework for Discussing AI Policy Implications in CS Courses

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Benchmarking LLM - Assisted Blue Teaming via Standardized Threat Hunting

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

LLM Watermark Evasion via Bias Inversion

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Explaining multimodal LLMs via intra-modal token interactions

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Integrated Framework for LLM Evaluation with Answer Generation

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload Systems

An Ethically Grounded LLM-Based Approach to Insider Threat Synthesis and Detection

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

Post Hoc Regression Refinement via Pairwise Rankings

On Task Vectors and Gradients

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Neural Logic Networks for Interpretable Classification

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

First Hallucination Tokens Are Different from Conditional Ones

Nonlinear Framework for Speech Bandwidth Extension

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Fair CCA for Fair Representation Learning: An ADNI Study

Model Parallelism With Subnetwork Data Parallelism

Towards a Progress Bar for Reasoning: Progress Prediction in Large Reasoning Models

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Training-free LLM Verification via Recycling Few-shot Examples

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

AS400-DET: Detection using Deep Learning Model for IBM i (AS/400)

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Estimating Visceral Adiposity from Wrist-Worn Accelerometry

MLLM-CL: Continual Learning for Multimodal Large Language Models

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Object Centric Concept Bottlenecks

The challenge of hidden gifts in multi-agent reinforcement learning

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Steering LLM Reasoning Through Bias-Only Adaptation

Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Learning to Rank Chain-of-Thought: Using a Small Model

Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Learning Hierarchical Domain Models Through Environment-Grounded Interaction

A Physics-Inspired Optimizer: Velocity Regularized Adam

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

BlobCtrl: Taming Controllable Blob for Element-level Image Editing

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making

DISCOVER: Data-driven Identification of Sub-activities via Clustering and Visualization for Enhanced Activity Recognition in Smart Homes

Toward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes

Distilling Calibration via Conformalized Credal Inference

Exploring and Controlling Diversity in LLM-Agent Conversation

Stability Bounds for the Unfolded Forward-Backward Algorithm

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

LLM-guided Task and Motion Planning using Knowledge-based Reasoning

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

3D Interaction Geometric Pre-training for Molecular Relational Learning

Balancing Multimodal Training Through Game-Theoretic Regularization

A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Hot PATE: Private Aggregation of Distributions for Diverse Task

Adversarial Attacks to Latent Representations of Distributed Neural Networks in Split Computing

Learning Dynamic Graph Embeddings with Neural Controlled Differential Equations

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Communicating-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Interactive Learning for LLM Reasoning

ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Created by

Haebom

Author

Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li

MMGeoLM: A Large-Scale Multimodal Model Based on Hard Negative Contrast Learning for Geometric Inference

Outline

To address the inability of ViT-based large multimodal models (LMMs) to capture subtle visual differences in geometric scenarios, this paper proposes a hard-negative contrast learning framework. This framework combines image-based contrast learning using generative hard negatives generated by modifying diagram generation code, and text-based contrast learning using rule-based negatives derived from retrieval-based negatives selected based on modified geometric descriptions and caption similarities. Using the proposed hard-negative training method, Multimodal Math CLIP (MMCLIP), the authors train a visual encoder (CLIP), which in turn trains an LMM for solving geometric problems. Experimental results show that the 7-byte MMGeoLM model significantly outperforms other open-source models on three geometric inference benchmarks, achieving performance comparable to powerful closed-form models such as GPT-4o. Additionally, through analysis of hard negative types, the efficiency of image-based negatives, and training configurations, we gain insights into optimizing the visual encoder training pipeline for fine-grained geometric inference tasks.

Takeaways, Limitations

•

Takeaways:

◦

Improving LMM model performance for solving geometric reasoning problems: MMGeoLM outperforms existing open-source models and achieves performance comparable to GPT-4o.

◦

Demonstrating the effectiveness of a hard negative contrast learning framework: We demonstrate that training a visual encoder using hard negatives contributes to improving geometric reasoning capabilities.

◦

Providing Insights into Training Pipeline Optimization: We propose a method to optimize the visual encoder training pipeline by analyzing factors such as hard negative type, efficiency of image-based negatives, and training configuration.

•

Limitations:

◦

The model size is limited: it was trained with a focus on the 7B model, so its performance on larger models requires further study.

◦

Only performance on specific geometric benchmarks is validated: generalization performance on a wider range of geometric problems requires further evaluation.

◦

Comparisons with closed models have limitations: Comparisons with closed models such as GPT-4o require caution in interpretation due to the lack of information about the model's internal structure and training data.

View PDF

Made with Slashpage