Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark

MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3D+T Biological Tissue Dynamics

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

Test-time Corpus Feedback: From Retrieval to RAG

$\Mathrm{TIME}[t]\subseteq \mathrm{SPACE}[O(\sqrt{t})]$ via Tree Height Compression

ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students’ Cognitive Abilities

Documenting Deployment with Fabric: A Repository of Real-World AI Governance

High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images

Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches

Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis

Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention

Fourier-Guided Attention Upsampling for Image Super-Resolution

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

Neural Logic Networks for Interpretable Classification

DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models

Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Leveraging GNN to Enhance MEF Method in Predicting ENSO

Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

SGD Convergence under Stepsize Shrinkage in Low-Precision Training

CLAP: Coreference-Linked Augmentation for Passage Retrieval

Geometry-Aware Spiking Graph Neural Network

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Enhancing material behavior discovery using embedding-oriented Physically-Guided Neural Networks with Internal Variables

Agentic large language models improve retrieval-based radiology question answering

PARROT: An Open Multilingual Radiology Reports Dataset

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Combining Cost-Constrained Runtime Monitors for AI Safety

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks

FlexOlmo: Open Language Models for Flexible Data Use

Multi-Level Fusion Graph Neural Network for Molecule Property Prediction

BiMark: Unbiased Multilayer Watermarking for Large Language Models

A foundation model with multi-variate parallel attention to generate neuronal activity

Effective Red-Teaming of Policy-Adherent Agents

From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis

LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Accountability Attribution: Tracing Model Behavior to Training Processes

Equivariant Spherical Transformer for Efficient Molecular Modeling

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate

Security Concerns for Large Language Models: A Survey

Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection

Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Learning with Spike Synchrony in Spiking Neural Networks

Explainable Prediction of the Mechanical Properties of Composites with CNNs

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

RT-Cache: Training-Free Retrieval for Real-Time Manipulation

DSADF: Thinking Fast and Slow for Decision Making

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

ICQuant: Index Coding enables Low-bit LLM Quantization

Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Theory of Mind in Large Language Models: Assessment and Enhancement

SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features

VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

CLaP -- State Detection from Time Series

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios

ImF: Implicit Fingerprint for Large Language Models

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

Investigating the Robustness of Deductive Reasoning with Large Language Models

Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

Evaluation of Large Language Models via Coupled Token Generation

Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Created by

Haebom

Author

Ruiyan Qi, Congding Wen, Weibo Zhou, Jiwei Li, Shangsong Liang, Lingbo Li

Outline

This paper proposes LETToT (Label-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought), a label-free LLM evaluation framework that leverages expert-derived inference structures to address the challenges of evaluating large-scale language models (LLMs) in specific domains such as tourism, particularly the high cost of annotated benchmarks and persistent issues such as hallucinations. LETToT iteratively refines and validates hierarchical ToT components using common quality dimensions and expert feedback. Experimental results show that systematically optimized expert ToTs achieve relative quality improvements of 4.99-14.15% compared to baselines. Furthermore, we evaluate models of various sizes (32B-671B parameters) and confirm that the scaling law holds even in specific domains (DeepSeek-V3 excels), while smaller models with enhanced inference (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap. For models with less than 72B, the explicit inference architecture demonstrated superior accuracy and parsimoniousness (p<0.05). This study establishes a scalable, label-free paradigm for domain-specific LLM evaluation, offering a compelling alternative to existing annotated benchmarks.

Takeaways, Limitations

•

Takeaways:

◦

We present a novel label-free framework, LETToT, for LLM assessment in specific domains such as tourism.

◦

Reduced dependence on annotation data by leveraging expert knowledge-based inference structures.

◦

Analysis of scaling laws and the effectiveness of inference architectures through comparative evaluations of LLMs of various scales.

◦

Presenting an alternative evaluation method that overcomes the limitations of existing benchmarks.

◦

Suggesting the possibility of performance improvement of small-scale models with enhanced inference.

•

Limitations:

◦

The performance of LETToT may depend on the quality of the inference structure provided by the expert.

◦

Generalization may be limited as the research results are limited to a specific domain (tourism).

◦

Further research is needed to ensure the objectivity of evaluation metrics and expert feedback.

◦

Scalability to other domains needs to be verified.

View PDF

Made with Slashpage