Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Rich Vehicle Routing Problem in Disaster Management enabling Temporally-causal Transhipments across Multi-Modal Transportation Network

Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

The Psychogenic Machine: Simulating AI Psychosis, Delusion Reinforcement and Harm Enablement in Large Language Models

Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective

Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

Zero-Knowledge Proofs in Sublinear Space

Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem

Training Text-to-Molecule Models with Context-Aware Tokenization

Towards Trustworthy Vital Sign Forecasting: Leveraging Uncertainty for Prediction Intervals

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

Language Models Identify Ambiguities and Exploit Loopholes

Mining the Long Tail: A Comparative Study of Data-Centric Criticality Metrics for Robust Offline Reinforcement Learning in Autonomous Motion Planning

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Hierarchical Evaluation Function: A Multi-Metric Approach for Optimizing Demand Forecasting Models

Posterior-GRPO: Rewarding Reasoning Processes in Code Generation

Legal Knowledge Graph Foundations, Part I: URI-Addressable Abstract Works (LRMoo F1 to schema.org)

Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization

Self-supervised learning on gene expression data

EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform

Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling

Analyzing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data

Defending against Indirect Prompt Injection by Instruction Detection

Using LLMs in Generating Design Rationale for Software Architecture Decisions

Evolution Meets Diffusion: Efficient Neural Architecture Generation

Direct Video-Based Spatiotemporal Deep Learning for Cattle Lameness Detection

FedDiverse: Tackling Data Heterogeneity in Federated Learning with Diversity-Driven Client Selection

Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation

Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning

CoPL: Collaborative Preference Learning for Personalizing LLMs

Structured Preference Optimization for Vision-Language Long-Horizon Task Planning

A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare

LocalEscaper: A Weakly-supervised Framework with Regional Reconstruction for Scalable Neural TSP Solvers

Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

A Deep Learning Pipeline for Solid Waste Detection in Remote Sensing Images

Learning Temporal Invariance in Android Malware Detectors

Beyond checkmate: exploring the creative chokepoints in AI text

Enhancing the De-identification of Personally Identifiable Information in Educational Data

LLM-ABBA: Understanding time series via symbolic approximation

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Mirror-Consistency: Harnessing Inconsistency in Majority Voting

DAVIS: Planning Agent with Knowledge Graph-Powered Inner Monologue

Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System

DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

XGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Towards Unified and Adaptive Cross-Domain Collaborative Filtering via Graph Signal Processing

Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks

Database-Augmented Query Representation for Information Retrieval

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts

Empowering Time Series Analysis with Foundation Models: A Comprehensive Survey

FedCoSR: Personalized Federated Learning with Contrastive Shareable Representations for Label Heterogeneity in Non-IID Data

Conformal Temporal Logic Planning using Large Language Models

Learn to Relax with Large Language Models: Solving Nonlinear Combinatorial Optimization Problems via Bidirectional Coevolution

Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives

Caught in the Act: a mechanistic approach to detecting deception

When Truthful Representations Flip Under Deceptive Instructions?

TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment

Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary

MAFA: A multi-agent framework for annotation

Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

Designing AI-Agents with Personalities: A Psychometric Approach

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Language models' activations linearly encode training-order recency

A Universal Banach--Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent, Learning and LLM Training

Dense Video Understanding with Gated Residual Tokenization

Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

Synthesizing Behaviorally-Grounded Reasoning Chains: A Data-Generation Framework for Personal Finance LLMs

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Queen Detection in Beehives via Environmental Sensor Fusion for Low-Power Edge Computing

Machines are more productive than humans until they aren't, and vice versa

Comprehensive Evaluation of CNN-Based Audio Tagging Models on Resource-Constrained Devices

Prompt2Auto: From Motion Prompt to Automated Control via Geometry-Invariant One-Shot Gaussian Process Learning

PhenoGnet: A Graph-Based Contrastive Learning Framework for Disease Similarity Prediction

SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation

You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency

Differential Privacy in Federated Learning: Mitigating Inference Attacks with Randomized Response

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

An Empirical Study on Failures in Automated Issue Solving

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

MAP: End-to-End Autonomous Driving with Map-Assisted Planning

DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

Created by

Haebom

Author

Xiaolong Wang, Zhi-Qi Cheng, Jue Wang, Xiaojiang Peng

Outline

This paper proposes a novel multimodal fashion image editing architecture, the Detail-Preserving Diffusion Model (DPDEdit). Based on a latent diffusion model, DPDEdit integrates text prompts, region masks, human pose images, and clothing texture images to guide fashion image generation. Grounded-SAM is used to predict the editing region based on the user's textual descriptions, and local editing is performed by combining these with other criteria. To transfer the details of a given clothing texture to the target fashion image, we propose a texture infusion and enhancement mechanism that preserves high-frequency details of the generated clothing texture using a separate cross-attention layer and an auxiliary U-Net. Furthermore, we extend the VITON-HD dataset with a multimodal large-scale language model to generate paired samples of texture images and textual descriptions. Experimental results demonstrate that DPDEdit outperforms state-of-the-art methods in terms of image fidelity and consistency with the given multimodal input.

Takeaways, Limitations

•

Takeaways:

◦

Effectively utilize multi-modal inputs (text, masks, poses, textures) to enable accurate and detailed fashion image editing.

◦

Solving the problem of identifying edit regions and preserving texture details through Grounded-SAM and texture injection and enhancement mechanisms.

◦

Expanding the VITON-HD dataset enables model training based on richer data.

◦

Achieving cutting-edge performance.

•

Limitations:

◦

Lack of analysis of the computational cost and processing time of the proposed method.

◦

Further evaluation of generalization performance across different fashion styles and clothing types is needed.

◦

Absence of actual user interface implementation and usability evaluation.

◦

The limitations of Grounded-SAM due to its dependence on Grounded-SAM may also affect DPDEdit.

View PDF

Made with Slashpage