Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Technical Survey of Reinforcement Learning Techniques for Large Language Models

Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing

How to Train Your LLM Web Agent: A Statistical Diagnosis

HAWK: A Hierarchical Workflow Framework for Multi-Agent Collaboration

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

Lyria: A General LLM-Driven Genetic Algorithm Framework for Problem Solving

Toward Better Generalization in Uncertainty Estimators: Leveraging Data-Agnostic Features

An ASP-Based Framework for MUSes

CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate

Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models

Agent Exchange: Shaping the Future of AI Agent Economics

LLMs model how humans induce logically structured rules

Uncovering Systemic and Environmental Errors in Autonomous Systems Using Differential Testing

From Query to Explanation: Uni-RAG for Multi-Modal Retrieval-Augmented Learning in STEM

Participatory Evolution of Artificial Life Systems via Semantic Feedback

Economic Evaluation of LLMs

RELRaE: LLM-Based Relationship Extraction, Labeling, Refinement, and Evaluation

Leveraging Large Language Models for Tacit Knowledge Discovery in Organizational Contexts

Generating Novelty in Open-World Multi-Agent Strategic Board Games

Learning Dark Souls Combat Through Pixel Input With Neuroevolution

Optimizing UAV Trajectories via a Simplified Close Enough TSP Approach

Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models

Roadmap for using large language models (LLMs) to accelerate cross-disciplinary research with an example from computational biology

Towards Unified Neurosymbolic Reasoning on Knowledge Graphs

Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning

Large Language Models for Combinatorial Optimization: A Systematic Review

EvoAgentX: An Automated Framework for Evolving Agentic Workflows

Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

A Universal Approach to Feature Representation in Dynamic Task Assignment Problems

Limits of Safe AI Deployment: Differentiating Oversight and Control

REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services

Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language

Artificial intelligence in drug discovery: A comprehensive review with a case study on hyperuricemia, gout arthritis, and hyperuricemic nephropathy

Effects of structure on reasoning in instance-level Self-Discover

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking

NDAI-NeuroMAP: A Neuroscience-Specific Embedding Model for Domain-Specific Retrieval

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

Memory Mosaics at scale

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systems

SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models

Discovering Algorithms with Computational Language Processing

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

Distributional Soft Actor-Critic with Diffusion Policy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Fast AI Model Splitting over Edge Networks

From Sentences to Sequences: Rethinking Languages in Biological Systems

MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentation in 4D Ultrasound

Horus: A Protocol for Trustless Delegation Under Uncertainty

Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center

AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Distinguishing Predictive and Generative AI in Regulation

Ain't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Text-Aware Image Restoration with Diffusion Models

How Good LLM-Generated Password Policies Are?

Towards an Explainable Comparison and Alignment of Feature Embeddings

Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

Empowering Intelligent Low-altitude Economy with Large AI Model Deployment

Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Threat Modeling for AI: The Case for an Asset-Centric Approach

SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings

PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification

Significativity Indices for Agreement Values

Transferrable Surrogates in Expressive Neural Architecture Search Spaces

Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

Uncertainty-Guided Coarse-to-Fine Tumor Segmentation with Anatomy-Aware Post-Processing

CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition

Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

Understanding-informed Bias Mitigation for Fair CMR Segmentation

HAPI: A Model for Learning Robot Facial Expressions from Human Preferences

MaizeField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel

Illuminant and light direction estimation using Wasserstein distance method

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

LLM - Powered Prediction of Hyperglycemia and Discovery of Behavioral Treatment Pathways from Wearables and Diet

Interleaved Gibbs Diffusion: Generating Discrete-Continuous Data with Implicit Constraints

EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks

Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference

Learning Traffic Anomalies from Generative Models on Real-Time Observations

Enabling Population-Level Parallelism in Tree-Based Genetic Programming for Comprehensive GPU Acceleration

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Quantifying the Importance of Data Alignment in Downstream Model Performance

Quantum-enhanced causal discovery for a small number of samples

On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability

Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

GeMID: Generalizable Models for IoT Device Identification

Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

Created by

Haebom

Author

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Charese H. Smiley

Outline

FinNLI is a benchmark dataset for financial natural language inference (FinNLI) based on various financial texts such as SEC filings, annual reports, and earnings releases. It consists of 21,304 premise-hypothesis pairs and includes 3,304 high-quality test sets annotated by experts. It uses a dataset framework that provides diverse pairs while intentionally minimizing correlations. The evaluation results show that the performance of general domain NLI models deteriorates significantly due to domain transfer. The best Macro F1 scores of pre-trained language models (PLM) and large-scale language models (LLM) are 74.57% and 78.62%, respectively, indicating the difficulty of the dataset. Interestingly, instruction-tuned financial LLMs underperform, suggesting limited generalization ability. FinNLI reveals the weaknesses of current financial inference capabilities of LLMs and suggests that there is room for improvement.

Takeaways, Limitations

•

Takeaways: Provides a new benchmark dataset for evaluating the performance of natural language inference models in the financial domain. Shows that current LLMs struggle with financial inference tasks, suggesting future research directions. Highlights the severity of the domain transfer problem.

•

Limitations: The size of the dataset may be relatively small compared to other large-scale language model training datasets. Further analysis is needed to understand the reasons for the poor performance of instruction-tuned finance LLM. The diversity of the dataset can be further expanded.

View PDF

Made with Slashpage