Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Personalized Safety Alignment for Text-to-Image Diffusion Models

Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Efficient Attention Mechanisms for Large Language Models: A Survey

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Diffusion Beats Autoregressive in Data-Constrained Settings

Generative Multi-Target Cross-Domain Recommendation

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

$\Texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

AI Agent Smart Contract Exploit Generation

Sign Spotting Disambiguation using Large Language Models

Can Vision Language Models Understand Mimed Actions?

Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

EarthSynth: Generating Informative Earth Observation with Diffusion Models

RLSR: Reinforcement Learning from Self Reward

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Explainable Recommendation with Simulated Human Feedback

Probabilistic Stability Guarantees for Feature Attributions

JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing

Towards Personalized Conversational Sales Agents: Contextual User Profiling for Strategic Action

Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization

Predicting the Lifespan of Industrial Printheads with Survival Analysis

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Teaching LLMs How to Learn with Contextual Fine-Tuning

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

GNN-Enhanced Fault Diagnosis Method for Parallel Cyber-physical Attacks in Power Grids

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

RLTHF: Targeted Human Feedback for LLM Alignment

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Rationale-guided Prompting for Knowledge-based Visual Question Answering

AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder

CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Getting out of the Big-Muddy: Escalation of Commitment in LLMs

NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

DSBC: Data Science task Benchmarking with Context engineering

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Style-Preserving Policy Optimization for Game Agents

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Created by

Haebom

Author

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

Outline

This paper highlights that training omnimodal Large Language Models (LLMs) remains a significant challenge due to the heterogeneous model architectures required to handle diverse modalities, necessitating sophisticated system design for large-scale training. Existing frameworks typically intertwine model definition and parallel logic, limiting the scalability and engineering overhead of end-to-end omnimodal training. In this paper, we present VeOmni, a modular and efficient training framework for accelerating omnimodal LLM development. VeOmni introduces model-centric distributed recipes that decouple communication from computation, enabling efficient 3D parallel processing in omnimodal LLMs. It also provides a flexible configuration interface that allows seamless integration of new modalities with minimal code changes. We demonstrate that using VeOmni, an omnimodal Mixture-of-Experts (MoE) model with 30B parameters can be trained at 2,800 tokens/second/GPU throughput and scale to 160K context lengths with 3D parallelism on 128 GPUs. This demonstrates excellent efficiency and scalability for large-scale omnimodal LLM training.

Takeaways, Limitations

•

Takeaways:

◦

We present VeOmni, a novel framework that significantly improves the efficiency and scalability of omnimodal LLM training by decoupling model definition and communication.

◦

Enabling large-scale omnimodal LLM training through 3D parallel processing.

◦

Easy integration of new modalities through a flexible configuration interface.

◦

Experimental results demonstrate VeOmni's superior performance and scalability.

•

Limitations:

◦

Further research is needed on the practical applications of VeOmni and its generalizability to various omnimodal LLM architectures.

◦

Possibly optimized for a specific hardware environment, requires verification of portability to other hardware environments.

◦

Further experiments and analysis are needed to determine the efficiency and stability of training on very large models.

View PDF

Made with Slashpage