/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
CTA: Cross-Task Alignment for Better Test Time Training
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization
Domain Generalizable Portrait Style Transfer
StreamDiT: Real-Time Streaming Text-to-Video Generation
From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Neural-Network solver of ideal MHD equilibria
RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria
Hita: Holistic Tokenizer for Autoregressive Image Generation
Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem
Horus: A Protocol for Trustless Delegation Under Uncertainty
Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-hot Subsurface Understanding
SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures
WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling
IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes
Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization
Instruction Following by Boosting Attention of Large Language Models
Evaluating Logit-Based GOP Scores for Mispronunciation Detection
LLMs on support of privacy and security of mobile apps: state of the art and research directions
On the Fundamental Impossibility of Hallucination Control in Large Language Models
Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting
CuVSLAM: CUDA accelerated visual odometry and mapping
Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge
An empirical study of task and feature correlations in the reuse of pre-trained models
EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG
Hume: Introducing System-2 Thinking in Visual-Language-Action Model
Towards General Continuous Memory for Vision-Language Models
Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)
Bayesian Hierarchical Invariant Prediction
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling
Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
The GenAI Generation: Student Views of Awareness, Preparedness, and Concern
Variational OOD State Correction for Offline Reinforcement Learning
Heat Diffusion Models - Interpixel Attention Mechanism
NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
Offline Learning and Forgetting for Reasoning with Large Language Models
Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models
PVChat: Personalized Video Chat with One-Shot Learning
Challenges and Trends in Egocentric Vision: A Survey
Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection
Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model
A Survey on Transformer Context Extension: Approaches and Evaluation
Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government
A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models
Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records
GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification
Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
RSPO: Regularized Self-Play Alignment of Large Language Models
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
Efficient Risk-sensitive Planning via Entropic Risk Measures
Bayesian Optimization for Controlled Image Editing via LLMs
Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation
Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment
Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions
A Theory for Conditional Generative Modeling on Multiple Data Sources
Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport
Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics
DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution
Aria-UI: Visual Grounding for GUI Instructions
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG
Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning
Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model
An AI Theory of Mind Will Enhance Our Collective Intelligence
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Longitudinal Ensemble Integration for sequential classification with multimodal data
Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI
Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort
Empirical evidence of Large Language Model's influence on human spoken communication
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics
CoDy: Counterfactual Explainers for Dynamic Graphs
Optimal Transport for Domain Adaptation through Gaussian Mixture Models
Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs
Detecting value-expressive text posts in Russian social media
Deep neural networks have an inbuilt Occam's razor
TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
MedGemma Technical Report
Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift
Activation Steering for Chain-of-Thought Compression
Load more
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
Created by
Haebom
作者
Haoxuan Wang, Jinlong Peng, Qingdong He, Hao Yang, Ying Jin, Jiafu Wu, Xiaobin Hu, Yanjie Pan, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
概要
本論文は、画像生成分野で発展する拡散モデルの制御可能性向上を目指し、様々な条件(テキストプロンプト、空間地図、基準画像など)を効果的に組み合わせて一貫した画像生成を可能にするUniCombineフレームワークを提示します。 UniCombine は Diffusion with Image Transformation (DiT) ベースであり、新しい Conditional MMDiT Attention メカニズムと学習可能な LoRA (Low-Rank Adaptation) モジュールを導入することで、学習なしでも学習でも動作できます。また、さまざまな条件を含む新しいデータセットSubjectSpatial200Kを作成して実験を進め、最先端のパフォーマンスを達成することを示しています。
Takeaways、Limitations
•
Takeaways:
◦
さまざまな条件(テキスト、空間情報、画像など)を効果的に組み合わせて画像生成を制御できる新しいフレームワークUniCombineを提示
◦
学習なしで動作可能なUniCombineのLoRAベースの実装で効率を向上
◦
多条件画像生成用の新しいデータセットSubjectSpatial200K公開
◦
従来法より優れた性能を示す実験結果の提示
•
Limitations:
◦
SubjectSpatial200Kデータセットの規模を今後さらに拡大する必要があります。
◦
さまざまな条件の組み合わせで発生する可能性がある一貫性の問題に関する追加の研究が必要です。
◦
提案されたフレームワークの他の拡散モデルまたは異なるタイプの条件入力に対する一般化性能に関するさらなる研究が必要である。
PDFを見る
Made with Slashpage