Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Neural Diffusion Processes for Physically Interpretable Survival Prediction

Tenyidie Syllabification corpus creation and deep learning applications

On Predictability of Reinforcement Learning Dynamics for Large Language Models

EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Normal-Abnormal Guided Generalist Anomaly Detection

Does Bigger Mean Better? Comparative Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

More Thoughts, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

The AI Productivity Index (APEX)

Discontinuous Epitope Fragments as Sufficient Target Templates for Efficient Binder Design

Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries

Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

IndexNet: Timestamp and Variable-Aware Modeling for Time Series Forecasting

An effective control of large systems of active particles: An application to evacuation problem

Discovering Software Parallelization Points Using Deep Neural Networks

SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models

Machines are more productive than humans until they aren't, and vice versa

Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Forecasting the Ionosphere from Sparse GNSS Data with Temporal-Fusion Transformers

Towards Methane Detection Onboard Satellites

Tackling Federated Unlearning as a Parameter Estimation Problem

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

Legal Knowledge Graph Foundations, Part I: URI-Addressable Abstract Works (LRMoo F1 to schema.org)

An Architecture for Spatial Networking

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

VITA: Vision-to-Action Flow Matching Policy

Can LLMs Find Fraudsters? Multi-level LLM Enhanced Graph Fraud Detection

Model Parallelism With Subnetwork Data Parallelism

A Novel Approach for Estimating Largest Lyapunov Exponents in One-Dimensional Chaotic Time Series Using Machine Learning

PlaceFM: A Training-free Geospatial Foundation Model of Places using Large-Scale Point of Interest Data

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

WWAggr: A Window Wasserstein-based Aggregation for Ensemble Change Point Detection

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

Localized Forest Fire Risk Prediction: A Department-Aware Approach for Operational Decision Support

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation

Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Enhanced DACER Algorithm with High Diffusion Efficiency

What happens when generative AI models train recursively on each others' outputs?

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap

Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Time-o1: Time-Series Forecasting Needs Transformed Label Alignment

MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

ScSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data

AI-Powered Inverse Design of Ku-Band SIW Resonant Structures by Iterative Residual Correction Network

Feature Representation Transferring to Lightweight Models via Perception Coherence

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

FalconWing: An Ultra-Light Indoor Fixed-Wing UAV Platform for Vision-Based Autonomy

WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms

Towards Effective E-Participation of Citizens in the European Union: The Development of AskThePublic

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Knowledge-guided machine learning for county-level corn yield prediction under drought

Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning

Interpretable Text Embeddings and Text Similarity Explanation: A Survey

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Forget Forgetting: Continual Learning in a World of Abundant Memory

Out-of-Distribution Detection using Synthetic Data Generation

Handling Heterophily in Recommender Systems with Wavelet Hypergraph Diffusion

Paper Quality Assessment based on Individual Wisdom Metrics from Open Peer Review

Diffusion Adversarial Post-Training for One-Step Video Generation

Unraveling Indirect In-Context Learning Using Influence Functions

Synergizing LLMs and Knowledge Graphs: A Novel Approach to Software Repository-Related Question Answering

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

Reasoning over User Preferences: Knowledge Graph-Augmented LLMs for Explainable Conversational Recommendations

Reliable Decision Making via Calibration Oriented Retrieval Augmented Generation

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

There and Back Again: On the relationship between Noise and Image Inversions in Diffusion Models

QSpec: Speculative Decoding with Complementary Quantization Schemes

Superficial Safety Alignment Hypothesis

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

R2 v2: The Pareto-compliant R2 Indicator for Better Benchmarking in Bi-objective Optimization

Hierarchical place recognition with omnidirectional images and curriculum learning-based loss functions

Neural Network Parameter-optimization of Gaussian pmDAGs

Semantic Bridges Between First Order c-Representations and Cost-Based Semantics: An Initial Perspective

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Communication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Created by

Haebom

Author

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Outline

This paper studies the Euclidean geometry problem as a surrogate task for solving spatial intelligence, which encompasses various abilities such as visual shape transformation, object rotation, relative position judgment, and numerical estimation, in multimodal large-scale language models (MLLMs). We constructed the Euclid30K multimodal dataset consisting of approximately 30,000 plane and three-dimensional geometric problems, and fine-tuned the Qwen2.5VL and RoboBrain2.0 models using Group Relative Policy Optimization (GRPO). As a result, the models showed zero-shot performance improvements on four spatial inference benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) after training on Euclid30K without any separate task-specific adaptation. In particular, the average accuracy of all models on VSI-Bench increased by 5.5 percentage points, from 34.5% to 40.5%, and the RoboBrain2.0-Euclid-7B model achieved an accuracy of 49.6%, outperforming the previous best-performing model, Spatial-MLLM. This study systematically demonstrates for the first time that geometry-focused fine-tuning can impart broadly transferable spatial skills to vision-language models.

Takeaways, Limitations

•

Takeaways:

◦

Improving the spatial reasoning capabilities of MLLMs by fine-tuning them using geometric problems.

◦

Demonstrating the effectiveness of the Euclid30K dataset and the GRPO methodology.

◦

Zero-shot performance improvements across various spatial inference benchmarks.

◦

Achieve performance that surpasses existing top-performing models

◦

Presenting a new approach to spatial intelligence research

•

Limitations:

◦

Lack of information about the Euclid30K dataset and resources used to train the model.

◦

Further validation of the model's generalization ability is needed.

◦

Applicability and performance verification for other spatial inference-related tasks are needed.

◦

Lack of in-depth analysis of the model's inference process

View PDF

Made with Slashpage