Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Emotions as Ambiguity-aware Ordinal Representations

From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity

Enhancing Model Privacy in Federated Learning with Random Masking and Quantization

Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

Principled Detection of Hallucinations in Large Language Models via Multiple Testing

Vocoder-Projected Feature Discriminator

ControlEchoSynth: Boosting Ejection Fraction Estimation Models via Controlled Video Diffusion

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark

VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation

Input-Time Scaling

LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

StreetViewAI: Making Street View Accessible Using Context-Aware Multimodal AI

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

GTPO: Trajectory-Based Policy Optimization in Large Language Models

Contrastive Multi-Task Learning with Solvent-Aware Augmentation for Drug Discovery

A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics

Invisible Architectures of Thought: Toward a New Science of AI as Cognitive Infrastructure

Revisiting Pre-trained Language Models for Vulnerability Detection

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Scaling Decentralized Learning with FLock

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Apple Intelligence Foundation Language Models: Tech Report 2025

Optimistic Exploration for Risk-Averse Constrained Reinforcement Learning

PyVision: Agentic Vision with Dynamic Tooling

DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Analyzing Character Representation in Media Content using Multimodal Foundation Model: Effectiveness and Trust

MEraser: An Effective Fingerprint Erasure Approach for Large Language Models

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Pseudo-Simulation for Autonomous Driving

BinConv: A Neural Architecture for Ordinal Encoding in Time-Series Forecasting

FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing

EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Heat Diffusion Models -- Interpixel Attention Mechanism

Bidirectional Task-Motion Planning Based on Hierarchical Reinforcement Learning for Strategic Confrontation

Multi-Type Context-Aware Conversational Recommender Systems via Mixture-of-Experts

Pricing AI Model Accuracy

Evaluating the Fitness of Ontologies for the Task of Question Generation

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation

PGAD: Prototype-Guided Adaptive Distillation for Multi-Modal Learning in AD Diagnosis

Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models

An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

Efficient PINNs via Multi-Head Unimodular Regularization of the Solutions Space

Statistical learning does not always entail knowledge

Score-based Generative Diffusion Models for Social Recommendations

PromptKeeper: Safeguarding System Prompts for LLMs

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Understanding Fairness-Accuracy Trade-offs in Machine Learning Models: Does Promoting Fairness Undermine Performance?

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Leveraging Multi-facet Paths for Heterogeneous Graph Representation Learning

Training with Explanations Alone: A New Paradigm to Prevent Shortcut Learning

Generation of Geodesics with Actor-Critic Reinforcement Learning to Predict Midpoints

TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

StepWiser: Stepwise Generative Judges for Wiser Reasoning

AniME: Adaptive Multi-Agent Planning for Long Animation Generation

AppAgent-Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance

AI Chaperones Are (Really) All You Need to Prevent Parasocial Relationships with Chatbots

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

General agents contain world models

Approximate Lifted Model Construction

Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search

Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents

Preference Elicitation for Multi-objective Combinatorial Optimization with Active Learning and Maximum Likelihood Estimation

Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Demonstrating specifications in gaming reasoning models

AirRAG: Autonomous Strategic Planning and Reasoning Steer Retrieval Augmented Generation

Think Smart, Act SMARL! Analyzing Probabilistic Logic Shields for Multi-Agent Reinforcement Learning

From Evidence to Decision: Exploring Evaluative AI

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning

Patch Progression Masked Autoencoder with Fusion CNN Network for Classifying Evolution Between Two Pairs of 2D OCT Slices

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Large Language Models (LLMs) for Electronic Design Automation (EDA)

Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment

Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach

Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation

MathBuddy: A Multimodal System for Affective Math Tutoring

Diffusion Language Models Know the Answer Before Decoding

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Dhati+: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation

WaveHiT-SR: Hierarchical Wavelet Network for Efficient Image Super-Resolution

The Next Layer: Augmenting Foundation Models with Structure-Preserving and Attention-Guided Learning for Local Patches to Global Context Awareness in Computational Pathology

Logical Reasoning with Outcome Reward Models for Test-Time Scaling

The Information Dynamics of Generative Diffusion

AI-Powered Detection of Inappropriate Language in Medical School Curricula

Generative AI for Testing of Autonomous Driving Systems: A Survey

Multispectral LiDAR data for extracting tree points in urban and suburban areas

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Created by

Haebom

Author

Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin

Outline

This paper proposes DeepScholar-bench, a novel benchmark for evaluating generative research synthesis systems. Existing question-answering benchmarks focus on short, factual responses, and their expert-curated datasets are often outdated or prone to data contamination, failing to adequately capture the complexity and evolving nature of real-world research synthesis tasks. DeepScholar-bench focuses on the real-world research synthesis task of extracting queries from the latest, high-quality arXiv articles and generating relevant research sections. This involves retrieving, synthesizing, and citing relevant research. The evaluation framework comprehensively assesses three key aspects: knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, an efficiently implemented reference pipeline using the LOTUS API, and systematically evaluate existing open-source systems, search AI, OpenAI's DeepResearch, and DeepScholar-base using the DeepScholar-bench framework. We find that DeepScholar-base establishes a robust baseline that achieves competitive or better performance. This shows that DeepScholar-bench is not yet saturated, as no system exceeds $19$ in any metric .

Takeaways, Limitations

•

Takeaways:

◦

DeepScholar-bench, a new benchmark for evaluating generative research systems, is presented.

◦

Benchmark design that reflects actual research tasks enables realistic evaluation.

◦

Presenting a powerful reference system called DeepScholar-base

◦

Providing important criteria for the development of the field of generative research

◦

Increasing research scalability through open source code disclosure

•

Limitations:

◦

DeepScholar-bench's score is still low (less than 19% of the best), leaving significant room for improvement.

◦

Further research is needed on generalizability with datasets limited to arXiv papers.

◦

Despite the comprehensive nature of the evaluation indicators, there is a need for additional evaluation of other aspects.

◦

Possible accessibility limitations due to LOTUS API dependency

View PDF

Made with Slashpage