Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CTA: Cross-Task Alignment for Better Test Time Training

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Domain Generalizable Portrait Style Transfer

StreamDiT: Real-Time Streaming Text-to-Video Generation

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Neural-Network solver of ideal MHD equilibria

RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Hita: Holistic Tokenizer for Autoregressive Image Generation

Empirical Analysis Of Heuristic and Approximation Algorithms for the Mutual-Visibility Problem

Horus: A Protocol for Trustless Delegation Under Uncertainty

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization

Instruction Following by Boosting Attention of Large Language Models

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

LLMs on support of privacy and security of mobile apps: state of the art and research directions

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

CuVSLAM: CUDA accelerated visual odometry and mapping

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

An empirical study of task and feature correlations in the reuse of pre-trained models

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Towards General Continuous Memory for Vision-Language Models

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

Bayesian Hierarchical Invariant Prediction

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Overcoming Data Scarcity in Generative Language Modeling for Low-Resource Languages: A Systematic Review

The GenAI Generation: Student Views of Awareness, Preparedness, and Concern

Variational OOD State Correction for Offline Reinforcement Learning

Heat Diffusion Models -- Interpixel Attention Mechanism

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Offline Learning and Forgetting for Reasoning with Large Language Models

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

PVChat: Personalized Video Chat with One-Shot Learning

Challenges and Trends in Egocentric Vision: A Survey

Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

A Survey on Transformer Context Extension: Approaches and Evaluation

Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

RSPO: Regularized Self-Play Alignment of Large Language Models

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Efficient Risk-sensitive Planning via Entropic Risk Measures

Bayesian Optimization for Controlled Image Editing via LLMs

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

A Theory for Conditional Generative Modeling on Multiple Data Sources

Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

Aria-UI: Visual Grounding for GUI Instructions

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model

An AI Theory of Mind Will Enhance Our Collective Intelligence

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Longitudinal Ensemble Integration for sequential classification with multimodal data

Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI

Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort

Empirical evidence of Large Language Model's influence on human spoken communication

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics

CoDy: Counterfactual Explainers for Dynamic Graphs

Optimal Transport for Domain Adaptation through Gaussian Mixture Models

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

Detecting value-expressive text posts in Russian social media

Deep neural networks have an inbuilt Occam's razor

TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

MedGemma Technical Report

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Activation Steering for Chain-of-Thought Compression

StreamDiT: Real-Time Streaming Text-to-Video Generation

Created by

Haebom

Author

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao

Outline

This paper points out that despite recent advances in text-to-video generation using large-scale transformer-based diffusion models, existing models only generate short videos and have limitations in real-time interactive applications. In this paper, we propose StreamDiT, a real-time streaming video generation model. StreamDiT improves content consistency and image quality through flow matching-based learning with moving buffers and blended learning using various buffer frame segmentation methods. We adopt adaLN DiT-based modeling using variational temporal embedding and windowed attention, and train a StreamDiT model with 4 billion parameters. In addition, we propose a multi-stage distillation method customized for StreamDiT, which performs sampling distillation at each segmentation interval and reduces the number of function evaluations to achieve real-time performance (16 FPS, 512p resolution). We verify the performance through quantitative indicators and human evaluation, and suggest its potential for real-time applications such as streaming generation, interactive generation, and video-to-video conversion.

Takeaways, Limitations

•

Takeaways:

◦

Proposing a StreamDiT model that enables real-time streaming video generation

◦

Real-time processing of 4 billion parameter models at 16 FPS

◦

Improve content consistency and image quality through blended learning and multi-stage distillation techniques

◦

Offers a wide range of real-time application possibilities, including streaming creation, interactivity creation, and video-to-video conversion.

•

Limitations:

◦

Performance currently limited to 512p resolution. Further research is needed to support higher resolutions.

◦

Further research is needed to determine the generalizability of the proposed distillation method.

◦

Lack of detailed analysis of the model's computational cost and memory consumption.

◦

Further research is needed on robustness for diverse text inputs.

View PDF

Made with Slashpage