Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Personalized Safety Alignment for Text-to-Image Diffusion Models

Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Efficient Attention Mechanisms for Large Language Models: A Survey

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Diffusion Beats Autoregressive in Data-Constrained Settings

Generative Multi-Target Cross-Domain Recommendation

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

$\Texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

AI Agent Smart Contract Exploit Generation

Sign Spotting Disambiguation using Large Language Models

Can Vision Language Models Understand Mimed Actions?

Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

EarthSynth: Generating Informative Earth Observation with Diffusion Models

RLSR: Reinforcement Learning from Self Reward

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Explainable Recommendation with Simulated Human Feedback

Probabilistic Stability Guarantees for Feature Attributions

JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing

Towards Personalized Conversational Sales Agents: Contextual User Profiling for Strategic Action

Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization

Predicting the Lifespan of Industrial Printheads with Survival Analysis

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Teaching LLMs How to Learn with Contextual Fine-Tuning

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

GNN-Enhanced Fault Diagnosis Method for Parallel Cyber-physical Attacks in Power Grids

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

RLTHF: Targeted Human Feedback for LLM Alignment

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Rationale-guided Prompting for Knowledge-based Visual Question Answering

AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder

CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Getting out of the Big-Muddy: Escalation of Commitment in LLMs

NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

DSBC: Data Science task Benchmarking with Context engineering

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Style-Preserving Policy Optimization for Game Agents

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Created by

Haebom

Author

Jungkoo Kang

Outline

This paper addresses the lack of scalable and reliable evaluation data to improve the planning and inference capabilities of large-scale language models (LLMs). To achieve this, we select a suitable domain, automated workflow generation, and present NL2Flow, a fully automated system for generating planning problems using natural language, structured intermediate representations, and formal PDDL. NL2Flow generates a dataset of 2,296 low-difficulty problems and evaluates several open-source, directive-tuned LLMs without task-specific optimization or architecture modification. The evaluation results show that the best-performing model achieves a success rate of 86% for generating valid plans and 69% for generating optimal plans for problems with feasible plans. Regression analysis demonstrates that the impact of problem characteristics varies depending on the model and prompt design. Furthermore, we investigate the potential of LLM as a natural language-to-JSON converter for workflow definitions and evaluate its translation performance on natural language workflow descriptions to facilitate integration with subsequent symbolic computation tools and symbolic planners. Converting natural language into a JSON representation of the workflow problem yielded lower success rates than directly generating a plan, suggesting that unnecessary decomposition of the inference task can degrade performance and highlighting the advantages of models capable of direct inference from natural language to actions. As LLM inference scales to increasingly complex problems, understanding the evolving bottlenecks and sources of error within these systems is crucial.

Takeaways, Limitations

•

Takeaways:

◦

Presenting a new evaluation method and dataset (NL2Flow) for automated workflow generation using LLM.

◦

Presentation of empirical analysis results on the plan generation capability of LLM (the best-performing model had an 86% success rate in generating valid plans and a 69% success rate in generating optimal plans).

◦

Provides insight into the interplay of problem characteristics, models, and prompt design.

◦

Suggesting directions for improving LLM inference strategies by comparing the efficiency of direct plan generation versus natural language-to-JSON conversion.

•

Limitations:

◦

Currently, only low-difficulty problems are evaluated (further research is needed to determine LLM performance on complex problems).

◦

Research limited to a specific domain (automated workflow generation) (generalizability to other domains needs to be verified)

◦

The LLM models used are limited to open-source, fine-tuned models (evaluation of the latest, large-scale models is required).

View PDF

Made with Slashpage