English
Share
Sign In
Status of media creation artificial intelligence research by big tech companies in China
Haebom
👍
After seeing the Facebook posting by Ryu Nae-won, I added additional explanations and personal opinions about each paper. ( Main text ) As Nae-won mentioned, China has also been releasing language models, and media generation, processing, and recognition models have been pouring out recently. It is true that I am sharing this with concern because I think the domestic attention is too focused on the United States, but the biggest implication is that Chinese AI companies are showing meaningful research results rather than just appealing with documents, by disclosing open source, papers, and even demos, rather than the previously closed method .
Alibaba
Alibaba is showing the most active movement, led by its subsidiary Alibaba Cloud, Tonyi Wanxiang, and ModelScopeGPT. Below are Alibaba's major research achievements. Alibaba is notable for developing an image processing model that can be used in all directions.
M3DDM: Hierarchical Masked 3D Diffusion Model for Video Outpainting (23.09)
We address the problem of video outpainting, i.e., filling in missing edges in video frames. We maintain temporal consistency using a mask 3D diffusion model and use different frames as guides to connect the results. We also mitigate the problem of artifact accumulation through a hybrid segmentation-refinement inference pipeline.
FaceChain: A Playground for Identity-Preserving Portrait Generation (23.08)
FaceChain is a framework that focuses on generating personalized portraits. It uses various face-related recognition models, such as face recognition, deep face embedding extraction, and face attribute recognition, to generate accurate and personalized portraits. It applies the latest face model to solve existing problems, which enables more efficient label tagging, data processing, and model post-processing.
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities (23.08)
Qwen-VL is a large-scale visual-language model that can process both text and images. It performs well on a variety of tasks, including image captioning, question answering, and visual localization, and outperforms existing LVLMs. The model has been evaluated on a variety of tasks to demonstrate its performance.
Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation (23.07)
This paper introduces Points-to-3D, a novel framework for generating 3D models from text. The framework generates more accurate and shape-controlled 3D models from text by leveraging knowledge of both 2D and 3D diffusion models. It effectively adjusts the geometry of NeRF using an efficient point cloud guidance loss.
VideoComposer: Compositional Video Synthesis with Motion Controllability (23.06)
💡
Based on this paper, I2VGen-XL (23.08) was released. ( Article )
VideoComposer is a flexible framework for composing videos considering text, space, and time conditions. It guides temporal dynamics by using motion vectors extracted from compressed video as explicit control signals. It effectively integrates spatial and temporal relationships of sequential inputs through a space-temporal conditional encoder (STC-encoder), thereby achieving higher frame-to-frame consistency.
Tencent
In Korea, Tencent is more famous for games, but in China, it is a company with an overwhelming presence as a platform. It feels like a mix of Kakao and Naver, but it is also good at game development... Anyway, Tencent has been showing a lot of sideways movement recently. Tencent is definitely pouring out a lot of technology that can be used in content production or social media.
VideoCrafter:A Toolkit for Text-to-Video Generation and Editing (23.04)
VideoCrafter is an open source toolbox for creating and editing videos based on text. The toolbox can create and edit videos based on various styles and conditions, including the basic Text-to-Video (T2V) model, VideoLoRA model, and VideoControl model. The toolbox can create and edit videos based on various styles and conditions.
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation (23.09)
StyleAdapter presents a method that does not use LoRA (Loosely-coupled Reference Attention) that takes a text prompt and a style reference image as input and produces a styled image in a single pass.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (23.08)
IP-Adapter is an adapter that effectively applies image prompts to text-image diffusion models. This adapter does not require complex prompt engineering and is compatible with various existing models. It has high performance with only 22M parameters and can also be compatible with various custom models.
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation (23.07)
This paper proposes a novel framework for generating coherent storytelling videos from existing video clips. This is achieved through two main modules: motion structure detection and structure-guided text-to-video synthesis, which allows specifying desired character identities through text prompts.
GenMM: Example-based Motion Synthesis via Generative Motion Matching (23.06)
GenMM is a generative model that extracts various motions from a single or a small number of example sequences. This model inherits the excellent quality of Motion Matching methods that do not require training, and can quickly generate high-quality motions even on complex and large skeletal structures.
T2M-GPT :Generating Human Motion from Textual Descriptions with Discrete Representations (23.01)
T2M-GPT is a model that generates human motion based on text description. This model combines VQ-VAE and GPT to obtain high-quality discrete representations, and outperforms recent diffusion-based approaches. In particular, it shows high performance on the HumanML3D dataset, which is an important indicator that the dataset size is the limitation of this approach.
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models (2022.12)
Dream3D proposes a new method for generating 3D models based on text. This method achieves better visual quality and shape accuracy by using an explicit 3D shape dictionary and a text-to-image diffusion model. This approach shows better results than the existing CLIP-guided 3D optimization method.
ByteDance (TikTok)
ByteDance, the company behind TikTok, has what I personally think is the best short-form video recording and editing app. If you want to try it out, I recommend pressing the + button on TikTok or using Capcut. ByteDance is showcasing technology that allows you to shoot and edit videos in more attractive and diverse formats.
MagicProp: Diffusion-based Video Editing via Motionaware Appearance Propagation (23.09)
MagicProp is a new framework that preserves motion while modifying the visual appearance of a video. The framework is divided into two stages, the first stage applies image editing techniques, and the second stage uses the PropDPM model to generate the remaining frames. This process ensures temporal consistency of the resulting video.
MVDream: Multi-view Diffusion for 3D Generation (23.08)
This paper introduces a multi-view diffusion model called MVDream. This model can generate geometrically consistent multi-view images from given text prompts. The most notable feature of this model is that it achieves both the generality of 2D diffusion and the consistency of 3D data.
MagicAvatar: Multimodal Avatar Generation and Animation (23.08)
MagicAvatar is a framework for generating and animating avatars using various modalities. It can generate avatars with various styles and expressions by combining GAN and VAE, and can apply real-time animations.
AudioLDM 2: A General Framework for Audio, Music, and Speech Generation (23.08)
AudioLDM 2 is a unified framework for audio, music, and speech generation. It is based on the "Language of Audio" (LOA) and utilizes GPT-2 and latent diffusion models to perform various audio generation tasks.
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360∘ (23.03)
PanoHead is a model that generates high-quality 3D head images from various angles of 360 degrees. This model introduces a new two-step self-adaptive image alignment and a tri-grid neural volume representation to generate more accurate and diverse 3D heads.
Subscribe to 'haebom'
📚 Welcome to Haebom's archives.
---
I post articles related to IT 💻, economy 💰, and humanities 🎭.
If you are curious about my thoughts, perspectives or interests, please subscribe.
Would you like to be notified when new articles are posted? 🔔 Yes, that means subscribe.
haebom@kakao.com
Subscribe
👍