I am a third year (2017-) Computer Science Ph.D. student at The University of Texas at Austin, supervised by Prof. Philipp Krähenbühl. I obtained my bachelor degree from School of Computer Science at Fudan University, advised by Prof. Wei Zhang and Prof. Xiangyang Xue. I am luckily having interned with Dr. Yichen Wei at Microsoft Research Asia, Tyler Zhu and Dr. Kevin MurPhy at Google Research, and Dr. Vladlen Koltun at Intel Lab.Here is my CV.
The profile photo is taken by my lovely girlfriend Jiarui Gao.
My research focuses on computer vision and computer graphics. Specifically, I have been working on various projects on object keypoints estimation. Here is a slide summarizing my recent works on generalized keypoint estimation. Here is an earlier (March 2018) seminar slide showing the connections/ motivations of my works on semantic keypoint estimation.
Xingyi Zhou, Dequan Wang, Philipp Krähenbühl
arXiv technical report, 2019
bibtex  /  code /  model zoo
Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point -- the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.
Xingyi Zhou, Jiacheng Zhuo, Philipp Krähenbühl
Computer Vision and Pattern Recognition (CVPR), 2019
bibtex  /  code /  model /  supplementary
Object detection drifted from a bottom-up to a top-down recognition problem. State of the art algorithms enumerate a near-exhaustive list of object locations and classify each into: object or not. In this paper, we show that bottom-up approaches still perform competitively. We detect four extreme points (top-most, left-most, bottom-most, right-most) and one center point of objects using a standard keypoint estimation network. We group the five keypoints into a bounding box if they are geometrically aligned. Object detection is then a purely appearance-based keypoint estimation problem, without region classification or implicit feature learning. The proposed method performs on-par with the state-of-the-art region based detection methods, with a bounding box AP of 43.2% on COCO test-dev. In addition, our estimated extreme points directly span a coarse octagonal mask, with a COCO Mask AP of 18.9%, much better than the Mask AP of vanilla bounding boxes. Extreme point guided segmentation further improves this to 34.6% Mask AP.
Xingyi Zhou, Arjun Karpur, Linjie Luo, Qixing Huang
European Conference on Computer Vision (ECCV), 2018
bibtex  /  code /  model /  supplementary /  poster
We propose a category-agnostic keypoint representation encoded with their 3D locations in the canonical object views. The representation consists of a single channel, multi-peak heatmap (StarMap) for all the keypoints and their corresponding features as 3D locations in the canonical object view (CanViewFeature) defined for each category. Not only is our representation flexible, but we also demonstrate competitive performance in keypoint detection and localization compared to category-specific state-of-the-art methods. Additionally, we show that when augmented with an additional depth channel (DepthMap) to lift the 2D keypoints to 3D, our representation can achieve state-of-the-art results in viewpoint estimation.
Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, Qixing Huang
European Conference on Computer Vision (ECCV), 2018
bibtex  /  code /  model /  poster
We introduce an unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan/image. Our key idea is to utilize the fact that predictions from different views of the same or similar objects should be consistent with each other. Such view consistency provides effective regularization for keypoint prediction on unlabeled instances. In addition, we introduce a geometric alignment term to regularize predictions in the target domain. The resulting loss function can be effectively optimized via alternating minimization.
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, Yichen Wei
International Conference on Computer Vision (ICCV), 2017
bibtex  /  code (torch) /  code (PyTorch) /  model /  supplementary /  poster
We propose a weakly-supervised transfer learning method that learns an end-to-end network using training data with mixed 2D and 3D labels. The network augments a state-of-the-art 2D pose estimation network with a 3D depth regression network. The 3D pose labels in controlled environments are transferred to images in the wild that only possess 2D annotations. Importantly, we introduce a 3D geometric constraint to regularize the prediction 3D poses, which is effective on images that only have 2D annotations.
Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, Yichen Wei
ECCV Workshop on Geometry Meets Deep Learning, 2016
bibtex  /  code /  poster
We propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation. The kinematic function is defined on the appropriately parameterized object motion variables. We show convincing experiment results on a toy example, and we achieve state-of-the-art result on Human3.6M dataset for the 3D human pose estimation problem.
Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, Yichen Wei
International Joint Conference on Artificial Intelligence (IJCAI), 2016
bibtex  /  code /  slides /  poster
We propose a model based deep learning approach that adopts a forward kinematics based layer to ensure the geometric validity of estimated poses. For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation.