Sign In

Current state of media-generating AI research among China's big tech companies

Haebom
After seeing the Facebook post shared by Ryu Naewon, I added some additional explanations and my personal opinions about each of the papers. (Main text) As Naewon mentioned, not only China's language models but also a flood of media generation, processing, and recognition models have been appearing recently. I’m sharing this partly out of concern because I feel that domestic focus is overly tilted toward the US, but what stands out most is that Chinese AI companies are no longer relying on closed approaches, but are publishing open-source code, papers, and even demos, and are demonstrating meaningful research results not just on paper, but through actual public releases.

Alibaba

Alibaba, led by its subsidiary Alibaba Cloud, has announced Tonyi Wanxiang and ModelScopeGPT, making it one of the most active players in the field. Below are some highlights of Alibaba’s major research achievements. Alibaba especially stands out for developing image processing models with broad and versatile applications.
M3DDM: Hierarchical Masked 3D Diffusion Model for Video Outpainting (23.09)
This addresses the challenge of video outpainting—filling in missing edges in video frames. It maintains temporal consistency using a masked 3D diffusion model, and connects results by guiding with various frames. The approach also alleviates the issue of artifact accumulation through a hybrid segmentation-refinement inference pipeline.
FaceChain: A Playground for Identity-Preserving Portrait Generation (23.08)
FaceChain is a framework focused on personalized portrait generation. It uses a variety of face recognition models—including face recognition, deep face embedding extraction, and facial attribute analysis—to generate accurate, personalized portraits. By applying state-of-the-art face models to solve earlier issues, it achieves more efficient label tagging, data processing, and post-processing of models.
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities (23.08)
Qwen-VL is a large-scale vision-language model capable of handling both text and images. It delivers strong performance on a wide range of tasks, such as image captioning, question answering, and visual localization, outperforming previous LVLMs. Its abilities have been demonstrated through evaluations on various tasks.
Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation (23.07)
This paper introduces a novel framework called Points-to-3D for generating 3D models from text. By leveraging both 2D and 3D diffusion models, the framework creates 3D models from text with greater accuracy and shape control. Using efficient point cloud guidance loss, it effectively tunes the geometry of NeRF.
VideoComposer: Compositional Video Synthesis with Motion Controllability (23.06)
💡
I2VGen-XL (23.08) was released based on this paper. (Article)
VideoComposer is a framework that flexibly composes videos by considering text, spatial, and temporal conditions all together. The framework uses motion vectors extracted from compressed video as explicit control signals to guide temporal dynamics, and through an STC-encoder (space-time conditional encoder), it effectively integrates spatial and temporal relationships of sequential inputs for higher consistency between frames.

Tencent

While Tencent is best known for games in Korea, in China it holds overwhelming platform dominance—almost like a blend of Kakao and Naver, but also excelling at game development... Anyway, recently Tencent has been aggressively expanding. It's clear Tencent is releasing a lot of technologies with applications in content production and social media.
VideoCrafter:A Toolkit for Text-to-Video Generation and Editing (23.04)
VideoCrafter is an open-source toolbox for generating and editing videos from text prompts. The toolbox includes a basic Text-to-Video (T2V) model, a VideoLoRA model, and a VideoControl model, enabling video creation and editing in various styles and conditions. In short, this toolbox lets you generate and edit videos based on different styles and requirements.
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation (23.09)
StyleAdapter proposes a method for stylized image generation that takes a text prompt and a style reference image as inputs and produces a stylized image in a single pass—without relying on LoRA (Loosely-coupled Reference Attention).
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (23.08)
IP-Adapter is an adapter that enables effective use of image prompts in text-to-image diffusion models. It doesn't require complex prompt engineering and is compatible with many existing models. With just 22M parameters, it delivers impressive performance and can be used with various custom models, too.
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation (23.07)
This paper proposes a new framework for creating coherent story-driven videos using existing video clips. It achieves this through two main components: motion structure retrieval and structure-guided text-to-video synthesis, letting you specify the desired character identities directly from text prompts.
GenMM: Example-based Motion Synthesis via Generative Motion Matching (23.06)
GenMM is a generative model that extracts diverse motion patterns from one or a few example sequences. It builds on the high quality of Motion Matching methods that require no separate training, and can quickly produce high-quality motions even for large and complex skeletal structures.
T2M-GPT :Generating Human Motion from Textual Descriptions with Discrete Representations (23.01)
T2M-GPT is a model that creates human motion from textual descriptions. By combining VQ-VAE and GPT, it produces high-quality discrete representations and outperforms recent diffusion-based approaches. Notably, it performs especially well on the HumanML3D dataset—demonstrating that dataset size is a critical factor limiting this approach.
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models (2022.12)
Dream3D presents a novel approach for generating 3D models from text. By using an explicit 3D shape prior and a text-to-image diffusion model, it achieves better visual quality and shape accuracy. This approach outperforms previous CLIP-guided 3D optimization methods.

ByteDance (TikTok)

ByteDance, the creator of TikTok, in my opinion, has the best short-form video shooting and editing app. If you want to try it out, I recommend tapping the + button on TikTok or using Capcut. ByteDance is introducing technologies that let you shoot and edit videos in more compelling and diverse formats.
MagicProp: Diffusion-based Video Editing via Motionaware Appearance Propagation (23.09)
MagicProp is a new framework for editing videos that preserves motion while changing the video's visual appearance. It works in two stages: the first stage applies image editing techniques, while the second generates the remaining frames with a PropDPM model. This approach ensures the temporal consistency of the final video.
MVDream: Multi-view Diffusion for 3D Generation (23.08)
This paper introduces a multi-view diffusion model called MVDream. It can generate geometrically consistent multi-view images from a given text prompt. A key feature of this model is that it achieves both the broad generality of 2D diffusion and the consistency of 3D data.
MagicAvatar: Multimodal Avatar Generation and Animation (23.08)
MagicAvatar is a framework that generates and animates avatars using multiple modalities. By combining GAN and VAE, it can create avatars in a range of styles and expressions, and supports real-time animation.
AudioLDM 2: A General Framework for Audio, Music, and Speech Generation (23.08)
AudioLDM 2 is a unified framework for generating audio, music, and speech. Based on the "Language of Audio" (LOA), it uses GPT-2 and latent diffusion models for a variety of audio generation tasks.
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360∘ (23.03)
PanoHead is a model that generates high-quality 3D head images from a full 360-degree range of views. It introduces a new, two-step self-adaptive image alignment process and a tri-grid neural volume representation, allowing it to create more accurate and diverse 3D heads.
Subscribe to 'haebom'
📚 Welcome to Haebom's archives.
---
I post articles related to IT 💻, economy 💰, and humanities 🎭.
If you are curious about my thoughts, perspectives or interests, please subscribe.
haebom@kakao.com
Subscribe