Abstract

Currently, human pose estimation (HPE) methods mainly rely on the design framework of Convolutional Neural Networks (CNNs). These CNNs typically consist of high-to-low-resolution subnetworks (encoder) to learn semantic information and low-to-high subnetworks (decoder) to raise the resolution for keypoint localization. Because too low-resolution feature maps in encoder will inevitably lose some spatial information, which cannot be recovered in the upsampling stages, keeping high spatial resolution features is critical for human pose estimation. On the other hand, due to scale variation of human body parts, multiscale features are also very important for human pose estimation. In this paper, a novel backbone network is proposed specifically for HPE, named High Spatial Resolution and Multiscale Networks (HSR-MSNet), which maintain high spatial resolution features in deeper layers of the encoder and meanwhile construct multiscale features within one single residual block via subgroup splitting and fusion of feature maps. Experiments show that our approach outperforms other state-of-the-art methods with more accurate keypoint locations on COCO dataset.

1. Introduction

Human pose estimation (HPE) is one of the most fundamental tasks in computer vision—which is aimed at predicting the locations of body joints from input images. Recently, the human pose estimation methods based on Convolutional Neural Networks (CNNs) have achieved a great breakthrough [16], since CNNs have the powerful ability to learn rich convolutional feature representations [7]. For example, for single-person pose estimation, the state-of-the-art models have improved the performance from less than 50% PCKh@0.5 to more than 90% PCKh@0.5 [812] on the MPII benchmark [13]. However, multiperson pose estimation still faces two main challenges:(1)There may be occlusion between different people, which will cause ambiguities of joints(2)Some invisible joints are hard to be predicted

In order to solve these challenges, existing methods, such as CPN [14] and SimpleBaseline [15], employ ResNet [16] as the backbone to obtain feature maps with large downsampling. However, too large downsampling will cause image spatial information loss [17], leading to difficulties for joint context recovery. On the other hand, due to camera view change or foreshortening, scales of different body parts may still be inconsistent, even if training images are warped to the same scale [9]. Therefore, scale variation of human body parts is also one of the main challenges. Previous works [9, 10, 18] have shown that multiscale or pyramid features are beneficial for solving the problems caused by scale changes.

In this paper, a novel backbone network is proposed specifically for HPE, named High Spatial Resolution and Multiscale Networks (HSR-MSNet). The network could maintain high spatial resolution features in deeper layers while keeping large receptive fields and construct multiscale features within one single residual block by channel split and fusion. Experiments on COCO keypoint detection dataset demonstrate the effectiveness of HSR-MSNet. At the same time, the network architecture of HSR-MSNet is very lightweight, which means that it will be possible to implement functions similar to MobileNet [19] on Internet-of-Things (IoT) devices.

2.1. Single-Person Pose Estimation

DeepPose [5] is the first human pose estimation method based on deep learning, which treats the body joint as a CNN-based regression problem, and its backbone consists of a softmax classifier, five convolution layers, and two fully connected layers. Subsequent methods mostly apply heatmaps that could characterize the probability of each keypoint at different locations for pose estimation [20]. The accurate location of a keypoint is further estimated by selecting the maximum value in the aggregation heatmaps. Convolutional Pose Machine (CPM) [21] is a multistage architecture where the belief maps and image features generated in the previous stage are served as input for the next stage [22]. For CPM, large receptive fields are used to learn long-range spatial relationships and the gradient vanishing problem is eliminated by using intermediate supervision. The features of stacked hourglass network [23] (Hourglass) are processed across all scales and consolidated to best capture the various spatial relationships associated with the body [23]. The above two models (CPM and Hourglass) achieve state-of-the-art performance, which all adopt intermediate supervision to generate detailed heatmaps for the joint locations.

2.2. Multiperson Pose Estimation

Multiperson pose estimation is a more challenging problem than single-person pose estimation for many computer vision applications. Due to occlusion and complex background, it is difficult to obtain accurate location results for multiperson pose estimation. A common approach is bottom-up, which detects human joints throughout the image region and then makes the groups of joint candidates for each person. The main problem of the bottom-up approach is to model the joint-to-individual associations [24]. Cao et al. [25] proposed a novel model to detect the 2D pose of several people in an image, which uses a nonparametric representation (Part Affinity Fields (PAFs)) to associate body parts with individuals in the given image. The architecture is aimed at learning part positions and their association jointly by two branches of the same sequential prediction process.

Another pipeline to multiperson pose estimation is top-down [14, 15, 26, 27], which first detects each person in the image and then conducts single-person pose estimation for each single person. This top-down approach is not suitable when crowds are in close proximity, because it will result in significant overlap between bounding box regions of people. Fang et al. [27] proposed a novel regional multiperson pose estimation (RMPE) framework, which applies SSD [28] or Faster RCNN [29] to locate persons in an image and uses Hourglass [23] to predict pose of each people. Wei et al. [21] proposed the Cascaded Pyramid Network (CPN) for multiperson pose estimation, which uses Mask RCNN [25] to detect persons and then designs CPN to predict each person’s pose.

2.2.1. High Spatial Resolution Features

Some state-of-the-art human pose estimation architecture, such as Hourglass [23], CPN [14], and SimpleBaseline [15], are shown in Figure 1. These CNN-based methods are typically encoder-decoder architecture, which consists of high-to-low resolution subnetworks (encoder) to learn semantic information and low-to-high subnetworks (decoder) to raise the resolution for the keypoint locations. Hourglass stacks multiple encoder-decoder subnetworks together to get progressively refined heatmaps. The “RefineNet” of CPN plays the role to explore the context information of “hard” keypoints to further improve the performance. SimpleBaseline simply takes ResNet as its backbone and adds additional deconvolution layers to raise the resolution of feature maps to predict keypoint heatmaps.

Because too low-resolution feature maps in encoder will inevitably lose some spatial information, which cannot be recovered in the upsampling stages, keeping high spatial resolution features is critical to improve the performance of human pose estimation.

In [30], DetNet maintains high spatial resolution in deeper layers to deal with the problem that large downsampling factors may compromise the location capability. Sun et al. [18] proposed the High-Resolution Net (HRNet) that consists of parallel high-to-low resolution subnetworks with multiscale feature fusion, which learns reliable high-resolution features by maintaining high-resolution representations through the whole networks. The information is exchanged repeatedly across multiresolution subnetworks, each of which receives information from other parallel ones.

2.2.2. Multiscale Features

Multiscale features are very important for pose estimation due to scale variation of human body parts. Most existing methods [14, 29] represent the multiscale features in a layer-wise manner fusing different level (scale) features together.

PyraNet proposed by Pishchulin et al. [8] and Res2Net proposed by Gao et al. [31] both construct multiscale features within one single residual block. By means of extending residual block to multiscale pyramids, PyraNet [8] designs Pyramid Residual Modules (PRMs) to enhance the invariance in scales. Multiscale features are obtained by applying different subsampling on input features in a multibranch residual block. Res2Net [31] represents multiscale features at a granular level and increases the range of receptive fields for each network layer. Specifically, Res2Net implements multiscale features via splitting channels of feature maps into subgroups and fusing these channel groups hierarchically. It has been proven that Res2Net can boost many backbone networks in some vision tasks, including object detection, semantic segmentation [32, 33], and salient object detection [29].

3. Our Approach

Similar to the previous works [14, 15, 25], we adopt the top-down pipeline for multiperson pose estimation, as illustrated in Figure 2. The whole framework consists of two parts: human detection and human pose estimation. First, a human detector is used to find all persons in the input image and generate a set of human bounding boxes. Then, a human pose estimation approach is applied to predict the keypoints for each single person by dealing with those human bounding boxes.

3.1. Human Detection

The state-of-the-art object detection method, YOLOv3 [34], is utilized for human detection. All eighty categories from the COCO dataset [35] are utilized to train YOLOv3, but only human bounding boxes are used for our model. In the network of YOLOv3, 53 convolution layers with some shortcut connections are used for image feature extraction and the size of feature map can be adjusted through the convolution stride. Drawing on the idea of feature pyramid networks, YOLOv3 uses multiple scales to detect objections with different sizes, the finer the grid cell, the finer the object can be detected. In addition, the softmax is replaced with logistic classifier; in this way, multilabel object detection can be supported when detecting objects. More detail about YOLOv3 could be found in [34].

3.2. Human Pose Estimation with High Spatial Resolution Features
3.2.1. Motivation

Backbone networks play an important role in human pose estimation because of their abilities to extract effective features from the input images, which is critical for classification and keypoint localization. ResNet [16], as a traditional backbone network, has been widely used [36] and achieved outstanding performance in many state-of-art networks for human pose estimation such as SimpleBaseline [15] and CPN [14]. Accordingly, ResNet has high efficiency for image feature extraction. However, there still exist the following shortcomings when ResNet is used as the backbone network for human pose estimation.(1)Poor accuracy of keypoint localization. There are five stages in ResNet, and in each stage, the feature maps are sampled down. Compared to the input image, the feature maps have strides of 32 with strong semantic information and have large valid receptive fields, which bring a great performance in classification task. However, the downsampling with 32 strides may lead to the loss of local information and further affect the accuracy of keypoint localization in human pose estimation(2)Invisibility of small joints. Another drawback of large stride is the missing of small keypoints. For some occluded joints which contain less information, the spatial resolution of input image is greatly reduced when the large stride feature maps are extracted(3)Ambiguities. In the case of occlusion between multiple persons, one human bounding box may contain the keypoints of other persons. There exists the loss of the context information while the input image is converted into feature maps with large strides. It is hard to distinguish which keypoints belong to the right person without the context information

To solve these problems, inspired by DetNet [30], we reserve the first four stages of ResNet and replace the fifth stage with two new stages, as shown in Figure 3(b), named as HSRNet (High Spatial Resolution Network). In these two stages, the feature maps are no longer sampled down and the valid receptive fields are expanded. Thus, we can not only ensure the classification of each keypoint but also reserve more semantic information of the feature maps, which will be helpful to improve the accuracy of keypoint localization.

3.2.2. Our Model for Keypoint Localization

A simple network structure is adopted in our model, as shown in Figure 4(b). Firstly, we use HSRNet as backbone network to generate feature maps with semantic information. Then, a few deconvolutional layers with batch normalization and ReLu activation are applied to generate heatmaps from the low-resolution feature maps. Finally, Mean Squared Error (MSE) is used as the loss between the predicted heatmaps and target heatmaps. The target heatmaps for keypoints are generated by applying a Gaussian centered at the ground-truth location of keypoints. Compared with SimpleBaseline [15], HSRNet is adopted to replace the original ResNet. Since the size of output feature maps of HSRNet and ResNet is different, the deconvolutional layer is reduced to keep the same size of feature map.

As shown in Figure 5(b), HSRNet is our backbone network with two new stages based on the existing ResNet [16]. As shown in Figure 6, in the two new stages, original bottlenecks A and B are slightly altered into two new bottlenecks C and D. Originalconvolution layer is converted into a convolution layer with dilation of 2. And the bottleneck C does not have a downsampling, which ensures that the size of feature maps will not change during these two new stages. Similar to ResNet, the stack method is applied, and original bottlenecks A and B are replaced by bottlenecks C and D, as shown in Figure 5. Since the fifth stage in ResNet has a downsampling with a stride of 2, if only a new stage is used, it will reduce the valid receptive field, leading to a negative effect for human pose estimation. Therefore, we utilize two new stages to gain feature maps of higher spatial resolution, simultaneously without sacrificing valid receptive field.

3.3. Human Pose Estimation with Multiscale Features

Based on Res2Net [31], we design a new multiscale module (MSNet) (as shown in Figure 7(b)) specifically and further integrate it into HSRNet to learn multiscale features. Our entire framework is named as HSR-MSNet.

The structure of Res2Net [31] is shown in Figure 7(a). Res2Net uses hierarchical groups of convolution filters to extract multiscale features. Specifically, Res2Net implements multiscale features by splitting channels of feature maps into subgroups (3 groups as shown in Figure 7(a)) and fuses these channel groups hierarchically. Since we have retained the first four stages of ResNet, Res2Net module can be easily integrated into the first four stages of our network backbone. As shown in Figure 7(b), we add a convolution layer compared with Res2Net, in order to increase the receptive fields. In addition, we use multiple smaller scale convolution layers to replace the convolution layer with stride 2. There are two advantages in this network structure. Firstly, multiple smaller convolution layers can learn more context information of keypoints compared with one convolution layer with stride 2, especially for small-scale persons. Since the output feature maps of stage 4 are 16x strides with respect to input image, the convolution layer with stride 2 will be more difficult to extract semantic features for small-scale persons in detail. Secondly, the MSNet module can extract deep semantic features in multiscale style, while keeping large receptive fields. We denote the output feature maps of the convolutional layer as , , where is the total numbers of subgroups that the feature maps are split into evenly. Then, could be expressed aswhere denotes the results of input feature maps after convolution and split evenly.

Among the multiscale features, not all the features are equally valid for human pose estimation. In order to balance the relationship among channel features, we add a SE (Squeeze-and-Excitation) block [37] before the residual connections (Figure 7(b)), which can learn the importance of each feature channel and promote important features while suppressing the less useful features for the current vision task. As a result, the final formula of our MSNet module could be written aswhere represents the concatenation of (), represents the final output feature maps, ReLu is the activation layer we used, and represents the SE block and convolution layer.

4. Experiments

4.1. Datasets

We evaluate our model on popular MS COCO benchmark [35]. There are more than 200K images and 250K person instances labeled with keypoints in the COCO dataset which contain train set, validation set, and test set. 150K person instances are publicly available for training and validation. Our models are only trained on COCO train2017 dataset (includes 57K images and 150K person instances), no extra data involved, and ablation studies are conducted on the COCO val2017 dataset. Finally, we report the results on COCO test-dev2017 set to make a fair comparison with the public state-of-the-art methods.

4.2. Model Training and Inference

Our models are trained on the NVIDIA Tesla P100 GPU using PyTorch and optimized by Adam algorithm with a batch size of 32 for 140 epochs. The learning rate is initialized as 0.0001 and decreased by a factor of 0.1 at 90th and 120th epoch. The ground-truth human box is made to a fixed aspect ratio, which is by extending the box in height or width. It is then cropped from the image and resized to a fixed resolution. The default resolution is , which is the same as the state-of-the-art methods [14, 15] for a fair comparison. During the model training, we use a pretrained model trained on ImageNet classification task [1].

We test HSR-MSNet with 59, 110, and 161 layers, named HSR-MSNet-59, HSR-MSNet-110, and HSR-MSNet-161, respectively. HSR-MSNet-59 is derived from [18], which can be download at https://github.com/guoruoqian/DetNet_pytorch. For HSR-MSNet-110 and HSR-MSNet-161, we adopt to initial the parameters of the first four stages from ResNet pretrained models [16].

The standard COCO metrics [35] are used to evaluate our approach, including AP (averaged precision), OKS (object keypoint similarity) thresholds, AP50 and AP75 (AP at different IoU (Intersection over Union) thresholds), APm and APl (AP at different scales: middle and large), and AR (average recall). The OKS plays the same role as the IoU in object detection, which is calculated from the distance between predicted keypoints and ground-truth keypoints normalized by scale of the person.

4.3. Quantitative Results

The experiments with HSRNet and HSR-MSNet are implemented to investigate the effectiveness of keeping high spatial resolution features and multiscale features for human pose estimation, respectively.

4.3.1. High Spatial Resolution Features

For the experiments with HSRNet, a human detector is introduced with AP 56.4 on COCO val2017, and the performance of ResNet and HSRNet with various options is listed in Table 1. Since the feature maps of HSRNet have higher spatial resolution than ResNet, two deconvolution layers are utilized to maintain the same size of output heatmaps. Methods a, b, c, d, e, g, h, i, j, and k with input size eventually generate heatmaps, and methods f and l with input size generate heatmaps.(1)Size-varied backbone. Methods a, b, and c and g, h, and i compare the results of HSRNet and ResNet by size-varied backbones, which illustrate that HSRNet is better than ResNet by 0.4 AP at least in comparable size of backbones. Similar to ResNet, the larger the size of HSRNet backbone, the better the performance. As can be seen from Table 1, AP increases 0.4 from HSRNet-59 to HSRNet-110 and 1.2 from HSRNet-59 to HSRNet-161(2)Various kernel sizes of deconvolution layers. Methods a, d, and e and j, i, and k prove that HSRNet also outperforms ResNet by at least 0.8 AP with various kernel sizes of deconvolution layers. AP increases 0.3 from kernel size 2 to 3 and 0.5 from kernel size 2 to 4 in HSRNet-59(3)Different sizes of input images. a and f and g and l methods illustrate that the higher the resolution of input images, the better the performance. In addition, HSRNet improves 1 AP compared with ResNet

Since HSRNet has more parameters than ResNet in comparable size of backbone, it may be hard to demonstrate that HSRNet structure is better than ResNet. However, it is worth noting that AP of methods b and g are the same. The performance is comparable between HSRNet-59 and ResNet-101, which implies the ability of high spatial resolution to improve the performance of human pose estimation significantly.

4.3.2. Multiscale Features

The experiments are implemented with HSR-MSNet and compared with HSRNet to show the importance of multiscale features for human pose estimation, as shown in Table 2.

A human detector is adopted with AP 56.4 on COCO val2017. Due to the fact that HSR-MSNet can extract semantic features with multiscale and assign weights to these features by the SE module, HSR-MSNet has a better performance than HSRNet with the same parameters which is improved by 0.1AP. Obviously, multiscale features play an important role in human pose estimation, because it requires an understanding of large-scale features for the classification of keypoints, as well as small-scale features for localization of keypoints. SE block is also indispensable since it can assign weights to different features and make the most effective features prominent among others. The combination of multiscale features and weight distribution can further improve human pose estimation.

4.3.3. Comparisons with Other State-of-the-Art Methods

The results of HSRNet and other state-of-the-art models including Hourglass [23], CPN [14], and SimpleBaseline [15] on the COCO val2017 dataset are shown in Table 3. For fair comparison, a human detector provided by SimpleBaseline is introduced with 56.4 AP, which is comparable to Hourglass and CPN with 55.3 AP.

Our method exceeds Hourglass by 4.5 AP in the same input size of . Compared with CPN, our method outperforms CPN without OHKM by more than 2.6 AP and CPN with OHKM by 1.6 AP. Although our model is based on SimpleBaseline, it has been proved in Table 1 that our method performs better than SimpleBaseline. It is obvious that these performance gains are benefited from keeping high spatial resolution in deeper layers of the backbone encoder networks.

To gain a better performance, a human detector of 60.9 AP is applied to obtain human bounding boxes on COCO test2017 dataset. For reference, CPN uses a human detector with person detection AP of 62.9 on COCO minival split dataset and SimpleBaseline uses a human detector of 60.9 AP on COCO std-dev split dataset. As shown in Table 4, CPN uses the ResNet-Inception and SimpleBaseline uses ResNet-152 for human pose estimation. Our method only utilized a small backbone but outperforms some state-of-the-art models including G-RMI [38] and FAIR [37] and achieves a potential performance of 72.6 AP and 72.8 AP.

Since our model is based on SimpleBaseline, the changes we made are very slight, which have a negligible impact on the overall parameters of the model. In other words, compared to SimpleBaseline, the computational complexity of our model is almost the same.

4.4. Qualitative Comparisons

Some qualitative comparison results on the COCO test2017 dataset are shown in Figure 8. We compare our approach with SimpleBaseline [15] and use comparable model to predict the keypoints for fair comparison. SimpleBaseline utilizes ResNet-50 as backbone, and our method is HSR-MSNet-59. As shown in Figure 8, the first row contains some original images of the COCO test2017 dataset, the second row is the results predicted by SimpleBaseline [15], and the third row is our results.

It is obvious that our method can better predict the keypoints of partially occluded or small people by utilizing higher resolution and multiscale feature maps. As the first column of Figure 8 has shown, our method can perceive and predict the keypoints more accurately for small people. For complex backgrounds, the second column illustrates that our method can separate the person from the background easily with high-resolution feature maps to avoid misidentification. In addition, our method can accurately predict the “hard” joints that are occluded or invisible in the third and fourth columns of Figure 8. Especially in the third column, HSR-MSNet makes it easier to detect the left arm which is heavily occluded in this image and gives a more accurate prediction. It is attributed to the ability of HSR-MSNet to mine more context information of “hard” joints.

5. Conclusion

In this paper, a novel backbone network, named HSR-MSNet, is proposed specifically for human pose estimation, which maintains high spatial resolution features in deeper layers of the encoder while still keeping large receptive fields. We also design a building module to learn multiscale features within one single residual block by splitting channels of feature maps into subgroups and then fuse these channel groups hierarchically. For multiperson pose estimation, our model can learn efficient context information of “hard” joints, due to partial occlusion or small scale of persons. Experiments on the COCO keypoint detection dataset show that our model outperforms other state-of-the-art methods, such as Hourglass [23], CPN [14], and SimpleBaseline [15] with respect to standard COCO metrics. At next steps, we will focus on evaluating our model on other human datasets, such as MPII dataset, and then further experiments on lightweight devices.

Data Availability

The data that support the findings of this study are openly available in COCO at doi:10.1007/978-3-319-10602-1_48, reference number [29].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China grant 61771299.