Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.
2023
VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis
Xinya Chen, Jiaxin Huang, Yanrui Bin, Lu Yu, and Yiyi♯ Liao
Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2023
Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing.
2020
Adversarial semantic data augmentation for human pose estimation
Yanrui Bin, Xuan Cao, Xinya Chen, Yanhao Ge, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Changxin Gao, and Nong♯ Sang
Proc. of the European Conf. on Computer Vision (ECCV), 2020
Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given offthe-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks. The code has been publicly available at https://github.com/Binyr/ASDA.
Structure-aware human pose estimation with graph convolutional networks
Human pose estimation is the task of localizing body key points from still images. As body key points are inter-connected, it is desirable to model the structural relationships between body key points to further improve the localization performance. In this paper, based on original graph convolutional networks, we propose a novel model, termed Pose Graph Convolutional Network (PGCN), to exploit these important relationships for pose estimation. Specifically, our model builds a directed graph between body key points according to the natural compositional model of a human body. Each node (key point) is represented by a 3-D tensor consisting of multiple feature maps, initially generated by our backbone network, to retain accurate spatial information. Furthermore, attention mechanism is presented to focus on crucial edges (structured information) between key points. PGCN is then learned to map the graph into a set of structure-aware key point representations which encode both structure of human body and appearance information of specific key points. Additionally, we propose two modules for PGCN, i.e., the Local PGCN (L-PGCN) module and Non-Local PGCN (NL-PGCN) module. The former utilizes spatial attention to capture the correlations between the local areas of adjacent key points to refine the location of key points. While the latter captures long-range relationships via non-local operation to associate the challenging key points. By equipping with these two modules, our PGCN can further improve localization performance. Experiments both on single- and multi-person estimation benchmark datasets show that our method consistently outperforms competing state-of-the-art methods.
Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals’ localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which consists of the Count Map and the Region Relation-Aware Module (RRAM). Each pixel in the count map represents the number of heads falling into the corresponding local area in the input image, which discards the detailed spatial information and forces the network pay more attention to counting rather than localizing individuals. Based on the Graph Convolutional Network (GCN), Region Relation-Aware Module is pro- posed to capture and exploit the important region dependency. The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations. Experimental results on three datasets show that our method obviously outper- forms other existing state-of-the-art methods.
Crowd counting is a concerned yet challenging task in computer vision. The difficulty is particularly pronounced by scale variations in crowd images. Most state-of-art approaches tackle the multi-scale problem by adopting multicolumn CNN architectures where different columns are designed with different filter sizes to adapt to variable pedestrian/object sizes. However, the structure is bloated and inefficient, and it is infeasible to adopt multiple deep columns due to the huge resource cost. We instead propose a Scale Pyramid Network (SPN) which adopts a shared single deep column structure and extracts multi-scale information in high layers by Scale Pyramid Module. In Scale Pyramid Module, we specifically employ different rates of dilated convolutions in parallel instead of traditional convolutions with different sizes. Compared to other methods of coping with scale issues, our single column structure with Scale Pyramid Module can get more accurate estimation with simpler structure and less complexity of training. And our Scale Pyramid Module can be easily applied to a deep network. Experimental results on four datasets show that our method achieves state-of-the-art performance. On ShanghaiTech Part A dataset which is challenging for its highly congested scenes and scale variation, we achieve 9.5% lower MAE and 13.5% lower MSE than the previous stateof-the-art method. We also extend our model on TRANCOS vehicle counting dataset and significantly achieve 5.9% lower GAME(0), 10% lower GAME(1), 24.5% lower GAME(2), 38.7% lower GAME(3) than the previous stateof-the-art method. The experimental results prove the robustness of our model for crowd counting, especially with scale variations.