Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Emotionally Vulnerable Subtype of Internet Gaming Disorder: Measuring and Exploring the Pathology of Problematic Generative AI Use

Explaining raw data complexity to improve satellite onboard processing

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training

Provable Speech Attributes Conversion via Latent Independence

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Paper2Video: Automatic Video Generation from Scientific Papers

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

Generalized Orders of Magnitude for Scalable, Parallel, High-Dynamic-Range Computation

LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Learning to Reason for Hallucination Span Detection

Panorama: Fast-Track Nearest Neighbors

Feature Identification via the Empirical NTK

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Real-time Noise Detection and Classification in Single-Channel EEG: A Lightweight Machine Learning Approach for EMG, White Noise, and EOG Artifacts

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

MORPH: Shape-agnostic PDE Foundation Models

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Evaluating LLM - Generated Versus Human-Authored Responses in Role-Play Dialogues

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

Reproducible workflow for online AI in digital health

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

A Survey of Reinforcement Learning for Large Reasoning Models

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

Scaling Performance of Large Language Model Pretraining

Towards Methane Detection Onboard Satellites

AEGIS: Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

Long Chain-of-Thought Reasoning Across Languages

MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

CoCoA: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

Understanding Teen Overreliance on AI Companion Chatbots Through Self-Reported Reddit Narratives

ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Truth, Trust, and Trouble: Medical AI on the Edge

LLMs on a Budget? Say HOLA

The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models

A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Rethinking Losses for Diffusion Bridge Samplers

Think With Videos For Agentic Long-Video Understanding

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Intention-Conditioned Flow Occupancy Models

Product of Experts for Visual Generation

Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Tug-of-war between idioms' figurative and literal interpretations in LLMs

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Inference-time Alignment in Continuous Space

STOPA: A Database of Systematic Variation Of DeePfake Audio for Open-Set Source Tracing and Attribution

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation

Hakim: Farsi Text Embedding Model

Understanding In-context Learning of Addition via Activation Subspaces

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

Long Chain-of-Thought Reasoning Across Languages

Created by

Haebom

Author

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

Outline

While large-scale inference models have demonstrated the remarkable ability to generate long chains of thought (CoTs) in English, understanding how this long-term inference ability transfers to the majority of languages worldwide remains limited. In this study, we systematically examine four key stages of model development—scaling, pretraining, posttraining, and inference—to understand how long CoT capabilities extend beyond English. We compare two inference settings for nine non-English target languages: En-CoT (where the model processes target language input but infers in English) and Target-CoT (where the model processes input and generates long CoTs in the target language). Increasing model size improves multilingual task performance in En-CoT but lags behind in Target-CoT performance. This gap is further exacerbated in tasks requiring long, multi-stage CoTs, such as mathematical reasoning. Moving to pretraining, adding specialized inference steps improves En-CoT performance but degrades Target-CoT, whereas extensive multilingual pretraining simultaneously improves both modes. Due to the lack of high-quality inference traces in languages other than English, we explore a synthetic data curation approach for post-training. We show that fine-tuning on automatically translated traces from the English traces of the Golden Letter outperforms fine-tuning on target language traces extracted from a large-scale inference model. Finally, we report discrepancies in inference efficiency across languages and identify language-specific failure modes in CoT. We make the model, dataset, and code publicly available for further research.

Takeaways, Limitations

•

Increasing the model size improves En-CoT (inference in English) performance, but not Target-CoT (inference in target language) performance.

•

The gap between En-CoT and Target-CoT widens in complex tasks such as mathematical reasoning.

•

Adding a specialized inference step has a positive effect on En-CoT but a negative effect on Target-CoT.

•

Extensive multilingual dictionary training benefits both En-CoT and Target-CoT.

•

Fine-tuning using machine-translated inference tracking from English is more effective than using target language tracking directly.

•

Language-specific differences exist in inference efficiency and CoT failure modes.

•

The lack of high-quality inference tracking data is a limitation.

View PDF

Made with Slashpage