Boyuan Jiang(姜博源)

I am a senior researcher at Tencent Youtu Lab, where I work on computer vision and machine learning. Recently, I am working for developing high-fidelity virtual try-on model for Tencent Cloud.

I got my B.A. from the Harbin Institute of Technology(HIT) in 2017 and got my M.A. degree from the Zhejiang University(ZJU) in 2020. I have ever worked at NetEase, SenseTime and Hikvision as research intern and joined Tencent Youtu Lab in 2020.

news

11/2024	We released FitDiT, a high-fidelity virtual try-on work based on SD3.
10/2024	We released FluxFit, a virtual try-on work based on FLUX.1-dev.
02/2024	One paper about fast identity-preserved personalization accepted by CVPR’24.
12/2023	One paper about video action recognition accepted by AAAI’24.
09/2023	One paper about video frame interpolation accepted by IEEE Transactions on Image Processing.
07/2022	One paper about image colorization accepted by ECCV’22.
03/2022	One paper about video frame interpolation accepted by CVPR’22.
03/2021	Our Team Imagination is the winner of CVPR NTIRE 2021 Challenge on Video Spatial-Temporal Super-Resolution.
12/2020	One paper about action recognition accepted by AAAI’21.
04/2020	I joined Tencent Youtu Lab.
03/2020	I graduated from Zhejiang University.
02/2020	One paper about domain adaption accepted by CVPR’20.
07/2019	One paper about action recognition accepted by ICCV’19.
11/2018	One paper about unsupervised domain adaption accepted by AAAI’19.

selected publications

ColorFormer: Image Colorization via Color Memory assisted Hybrid-attention Transformer

Ji, Xiaozhong*, Jiang, Boyuan*, Luo, Donghao, Tao, Guangpin, Chu, Wenqing, Xie, Zhifeng, Wang, Chengjie, and Tai, Ying

European Conference on Computer Vision (ECCV) 2022

Abs PDF

Automatic image colorization is a challenging task that attracts a lot of research interest. Previous methods employing deep neural networks have produced impressive results. However, these colorization images are still unsatisfactory and far from practical applications. The reason is that semantic consistency and color richness are two key elements ignored by existing methods. In this work, we propose an automatic image colorization method via color memory assisted hybrid-attention transformer, namely ColorFormer. Our network consists of a transformer-based encoder and a color memory decoder. The core module of the encoder is our proposed global-local hybrid attention operation, which improves the ability to capture global receptive field dependencies. With the strong power to model contextual semantic information of grayscale image in different scenes, our network can produce semantic-consistent colorization results. In decoder part, we design a color memory module which stores various semantic-color mapping for image-adaptive queries. The queried color priors are used as reference to help the decoder produce more vivid and diverse results. Experimental results show that our method can generate more realistic and semantically matched color images compared with state-of-the-art methods. Moreover, owing to the proposed end-to-end architecture, the inference speed reaches 40 FPS on a V100 GPU, which meets the real-time requirement.
IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation

Kong, Lingtong*, Jiang, Boyuan*, Luo, Donghao, Chu, Wenqing, Huang, Feiyue, Tai, Ying, Wang, Chengjie, and Yang, Jie

Computer Vision and Pattern Recognition (CVPR) 2022

Abs PDF Code

Prevailing video frame interpolation algorithms, that generate the intermediate frames from consecutive inputs, typically rely on complex model architectures with heavy parameters or large delay, hindering them from diverse real-time applications. In this work, we devise an efficient encoder-decoder based network, termed IFRNet, for fast intermediate frame synthesizing. It first extracts pyramid features from given inputs, and then refines the bilateral intermediate flow fields together with a powerful intermediate feature until generating the desired output. The gradually refined intermediate feature can not only facilitate intermediate flow estimation, but also compensate for contextual details, making IFRNet do not need additional synthesis or refinement module. To fully release its potential, we further propose a novel task-oriented optical flow distillation loss to focus on learning the useful teacher knowledge towards frame synthesizing. Meanwhile, a new geometry consistency regularization term is imposed on the gradually refined intermediate features to keep better structure layout. Experiments on various benchmarks demonstrate the excellent performance and fast inference speed of proposed approaches.
Learning Comprehensive Motion Representation for Action Recognition

Wu, Mingyu*, Jiang, Boyuan*, Luo, Donghao, Yan, Junchi, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Yang, Xiaokang

AAAI Conference on Artificial Intellige (AAAI) 2021

Abs HTML PDF Code

For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency. Moreover, the feature enhancement is often only performed by channel or space dimension in action recognition. To address these issues, we first devise a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. The channel gates generated by CME incorporate the information from all the other frames in the video. We further propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps. The intuition is that the change of background is typically slower than the motion area. Both CME and SME have clear physical meaning in capturing action clues. By integrating the two modules into the off-the-shelf 2D network, we finally obtain a Comprehensive Motion Representation (CMR) learning method for action recognition, which achieves competitive performance on Something-Something V1 & V2 and Kinetics-400. On the temporal reasoning datasets Something-Something V1 and V2, our method outperforms the current state-of-the-art by 2.3% and 1.9% when using 16 frames as input, respectively.
Stm: Spatiotemporal and motion encoding for action recognition

Jiang, Boyuan, Wang, MengMeng, Gan, Weihao, Wu, Wei, and Yan, Junjie

International Conference on Computer Vision (ICCV) 2019

Abs HTML PDF

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.