Abstract

Pedestrian detection with large intraclass variations is still a challenging task in computer vision. In this paper, we propose a novel pedestrian detection method based on Random Forest. Firstly, we generate a few local templates with different sizes and different locations in positive exemplars. Then, the Random Forest is built whose splitting functions are optimized by maximizing class purity of matching the local templates to the training samples, respectively. To improve the classification accuracy, we adopt a boosting-like algorithm to update the weights of the training samples in a layer-wise fashion. During detection, the trained Random Forest will vote the category when a sliding window is input. Our contributions are the splitting functions based on local template matching with adaptive size and location and iteratively weight updating method. We evaluate the proposed method on 2 well-known challenging datasets: TUD pedestrians and INRIA pedestrians. The experimental results demonstrate that our method achieves state-of-the-art or competitive performance.

1. Introduction

Pedestrian detection is an important instant of object detection. Because of its direct applications in surveillance, intelligent traffic systems, and assisted living [1, 2], it has attracted lots of attention. However, detecting pedestrians with high requirements of real-world applications is still a challenging task due to large intraclass variations caused by different views and articulated poses, partial occlusion, and changes in illumination. In recent years, a number of methods have been proposed to get robust and applied detection. They can be roughly classified into 3 categories, that is, works built on holistic model [37], part/patch-based approaches [816], and detectors using multiple feature channels and boosted classifier [1722].

The first category methods take the whole pedestrian as input and make decisions by SVM or template matching. In 2005, Dalal and Triggs [3] proposed Histograms of Oriented Gradients (HOG) feature to encode information of an entire pedestrian, and the detector was trained on HOG features using linear SVM. Since then, some variations [4] and combinations [5, 6] have been proposed to improve the detection performance. In [7, 23], Dominant Orientation Templates (DOT) are used for fast feature calculation, and a holistic detection is defined by template matching. Holistic methods can detect pedestrian fast and accurately in simple scenes; however, the detection performance decreases sharply when the appearance of pedestrian changes due to multiple factors, such as illumination, views, and poses.

The second category methods have two different implementations, that is, Deformable Parts Model (DPM) based [811] and Implicit Shape Model (ISM) based [1216]. DPM [8] and its varieties [911] extend the work in [3] with multiple local parts and spatial configurations of these parts by latent SVM. This kind of methods significantly improves detection performance in cluttered scenes. However, the process is time-consuming and some properties of local parts, such as number and size, should be predefined. ISM based methods [12, 13] use small, local image patches to vote for object center with the generalized Hough transform [24]. Hough Forests [1416] extend the standard Random Forest [25] for learning the codebook of ISM. ISM based methods have been widely used for detecting facial feature points [26] and body joints for human pose estimate [27]. However, they get limited success on pedestrian detection because of dense sampling and enormous, scattered votes for the whole human.

The last category methods assemble multiple weak classifiers by boosting algorithm [28]. Each weak classifier is defined by a selected feature channel [17, 18] or a representative exemplar [19]. Particularly, tree structured detectors [2022] can not only assemble multiple weak classifiers, but also model intraclass subcategories as different branches of the tree. Methods based on tree structured classifiers are good for multiview, multipose pedestrian detection and can obtain very fast detection with cascade architecture [29]. However, the weak classifier cannot divide the sample space optimally because the feature selector and split function used are too simple.

In this paper, we absorb the advantages of approaches mentioned above and try to solve some key problems of pedestrian detection, as shown in Figure 1. It is obvious that all pedestrians are hard to be divided by a holistic property; however, some representative local templates with small varieties can be found. Generally, they are the main parts of a human, such as head, left/right hand, and left/right leg. Motivated by this observation, we propose a new pedestrian detection method which combines multiple weak classifiers built on local templates by means of Random Forest. The templates are adaptively generated with different sizes and different locations in positive exemplars. The splitting functions in the forest are learned by the joint use of template selector and template matching. A weak classifier consists of splitting functions in one depth of the forest. To improve classification accuracy, all weak classifiers are assembled by a boosting-like algorithm [21, 28] with the weights of samples updated iteratively. When a weak classifier is added, the depth of the forest increases until it reaches the predefined maximum depth. For fast calculation, the local template is represented by Dominant Orientation Template (DOT) feature [23]. During detection, a sliding window is passed through each tree, and the final decision is made by averaging estimations of all trees in the forest. To accelerate the detection process, we propose to use cascade detection architecture.

The proposed detection method is evaluated on two well-known pedestrian datasets: TUD pedestrian and INRIA pedestrian, where it achieves state-of-the-art or competitive performance. Our method is on par with the most successful part-based detection system [8]; however, far less design complexity and computation complexity are needed.

The major contributions of our method can be summarized as follows:(i)We define multiple adaptive local DOT templates with different sizes and different locations to represent the parts of a pedestrian.(ii)We learn each splitting function in the forest based on template selector and template matching.(iii)We use a boosting-like algorithm to update the weights of the training samples in a layer-wise fashion.

The rest of this paper is organized as follows. Section 2 describes some related works. Section 3 gives an overview of our method. Section 4 introduces the proposed method. And Section 5 describes the usage of our method, followed by the implementation details and experimental results in Section 6. Section 7 presents our conclusions and future work.

The most related approaches to ours are the works based on DPM [810] which extend the rigid HOG template and SVM approach of [3] with deformable parts and multiple components. In those methods, each deformable part is explicitly defined by a local template and a relative offset vector with respect to the object center. The intraclass variation is captured by dividing the training data into multiple components according to the aspect ratio. The final decision is made by the scores of each template matching minus a deformation cost that depends on the relative position of each part. Nevertheless, the DPM has some disadvantages because (i) different models have to be trained for each component and (ii) the explicit definitions of local templates and their relative offset vectors are complicated and time-consuming. In contrast, we have a single model that captures the intraclass variability by different branches of the tree. Furthermore, the local parts can be shared between different components, and the position relations between different parts are represented implicitly during assembly.

Some methods [20, 3032] build on a similar Boosting framework for learning the object models. The influential work on Integral Channel Features [32] computes several feature channels, including color, gradient magnitude, and orientation quantized gradients, which is similar to the DOT feature used in our method. However, the weak classifiers in those methods are only defined by the selected feature channels, and they have not sufficiently made use of the advantages of DPM which has shown state-of-art results on several challenging datasets. In addition, with multiple tree structured classifiers in Random Forest, the weak classifier in our method fully considers multiple splits defined by different local templates, which is more robust to various intraclass variations.

Random Forest has attracted a lot of attention in computer vision. Schulter et al. [21, 22] propose a new Alternating Decision Forest (ADF) classifier for object detection. All trees in the forest are treated as a whole, and the forest is constructed by alternating between training a single depth of the forest and updating the weights of samples for the next depth until the same stopping criterion as in standard Random Forest is reached. Our method adopts a similar way; however, the split function of the forest in ours is defined by local template matching instead of single feature comparison. Yao et al. [33] propose that each node in the forest selects a rectangular region and applies a linear SVM onto the regions of all samples for splitting. Although multiple features are used, the matching of local regions represented by DOT feature can be computed rapidly by bitwise operations. Tang et al. [7] present a new pedestrian detection method combining Random Forest and DOT feature to achieve fast detection; however, the DOT feature is used for representing holistic template which has been proven inflexible for detecting object with intraclass variations.

3. Overview of Our Method

In this section, we give an overview of our method. As shown in Figure 2, it mainly contains extracting DOT feature, generating adaptive local templates and constructing the forest in a layer-wise manner with splitting function defined by template matching.

The first step of our method includes data preparation and DOT feature extraction. The training images denoted as , where is a training sample and is the class label of the sample (−1: negative, 1: positive). For each tree , a training set with images is sampled in by means of bootstrap [34]. Similarly, an exemplar set is randomly sampled from (positive samples in ) with much smaller size than that of . With these two sample sets prepared for , the corresponding DOT feature set for training set and exemplar set are denoted as and , respectively. Figure 2(a) illustrates the basic process of extracting DOT feature.

With the data prepared above, the training process begins. Firstly, a few adaptive local templates with different sizes and different locations are generated for each exemplar set , as shown in Figure 2(b). With the generated local templates, the splitting function at a node in the tree is defined by a selected template and template matching with samples at this node. Given a threshold, the samples can be split into two subsets according to the matching results. The optimal splitting function is found by maximizing the class purity of the divided subsets. Each tree is constructed by splitting the samples recursively until one of the stopping criterions is reached.

To improve classification accuracy, we propose to train the forest in an iterative, layer-wise manner, like boosting algorithm. The iterations are indexed with the current depth of forest , as shown in Figure 2(c). To this end, we define the weight vector of training samples for in depth as , . It is set uniformly in the first depth and updated iteratively. The class distribution of each node is estimated based on labels and weights of samples. With these definitions, the iterations begin. In iteration , we firstly find the optimal split functions of nodes in each tree . Then, the samples at each node are split into two child nodes according to the learned splitting functions. Thirdly, a newly weak classifier consisting of the learned splitting functions is added to the trained boosting classifier . Finally, we use the trained boosting classifier to update the weights of samples in depth by minimizing a global loss. After all iterations, the training process is finished.

For pedestrian objection, we adopt standard sliding window method in test image. Each window represented by DOT feature is passed through each tree in the trained forest. The final decision is made by averaging estimations of all trees. To accelerate the detection process, we adopt cascade detection architecture to reject negative window as early as possible.

4. Proposed Method

In this section, we firstly introduce some basic concepts and elements of Random Forest [25] since our method is built on Random Forest. Then, we propose a novel splitting function built on adaptive local DOT template. Finally, we give the definition of weak classifier in our method and describe how to assemble multiple weak classifiers in layer-wise fashion by boosting algorithm.

4.1. Introduction to Random Forest

Standard Random Forest [25] is an ensemble of randomized binary decision trees , which describe a nonlinear mapping from -dimensional feature space to label space (although the Random Forest is inherently multiclass, we only consider the binary for pedestrian detection). Given a sample , each tree returns a score defined by class probability distribution ; the final class label is obtained via maximizing the total average score of trees:

Random Forest assembles multiple trees by means of bootstrap [34]. The trees in forest are constructed independently from each other by recursively splitting samples at each node such that the class-label purity of samples reaching the newly created child node increases, until one of the following stopping criterions is met: (1) the depth of node is equal to the maximal one; (2) the number of samples reaching the node is too small; (3) the class-label purity of samples reaching the node is high enough.

Generally, a splitting function is parameterized by two values, a selected feature dimension , and a threshold . The splitting function is then written aswhere defines which child node the sample reaches.

Each node chose the best splitting function out of random sampled set by maximizing class-label purity defined bywhere and are the sets of samples which reach the left and right child node, respectively, according to . denotes the size of a set, and measures the class-label purity of a sample set. In this paper, we use negative entropy to calculate , which is defined asHere, is the probability for class , estimated by the ratio of positive or negative samples in .

4.2. A Novel Split Function with Adaptive Local DOT Template

The key point of Random Forest based detector is to design a fast and effective splitting function. As discussed in [25], a nonlinear hyperplane outperforms axis-aligned ones. In this section, we propose a novel splitting function which is defined by adaptive local DOT template and nonlinear template matching.

The prerequisite of the proposed splitting function is to compute DOT feature. As described in [35], we firstly give a brief introduction about DOT feature extraction in this paper. DOT feature is a block based descriptor, and each pixels block encodes the discretized gradient orientations of the 7 strongest gradients into one byte. With a defined threshold, the first bit indicates whether this block is uniform, and the 7 dominant orientations are quantized into the remaining 7 bits. In order to make the matching invariant to small local deformations, translations in the range for each block are considered. Figure 2(a) gives an illustration of computing DOT feature. In order to tolerate changes caused by colors and illuminations, similar to [7], we also encode the HSV color space in the similar way; that is, each block encodes the discretized H value of the 7 strongest V into one byte. The final binary representation of each block is 16-bit descriptor formed by concatenating dominant orientations descriptor and dominant colors descriptor. For simplification, we call the representation built on two mentioned DOT-based descriptors DOT feature. The similarity of DOT template and training image described by DOT feature is defined by bitwise AND operations.

We suppose that all training images can be partitioned into overlapping blocks, and each block is encoded as 16-bit DOT feature. is the DOT training set at a node in tree . is the DOT exemplar set of , and each is a holistic DOT template since it describes a whole pedestrian. The adaptive local DOT template is generated using a local template selector illustrated in Figure 2(b). The basic idea is as follows. Firstly, an exemplar is randomly selected in . Then, an adaptive local DOT template is generated by randomly selecting a rectangular region including contiguous blocks in . The top-left coordinate of is denoted as , and the width and height are randomly generated with the predefined maximum size . Note that the coordinate here is based on blocks.

With the selected adaptive local template and configuration , is divided by comparing local template with local DOT features of all training samples, and each local DOT feature is computed according to configuration . Therefore, the splitting function in (2) becomes Here, is the local DOT feature in training sample with configuration ; is the matching function of DOT feature; is the bitwise AND operation; and are 16-bit DOT feature for a block at location in and , respectively; is used for counting the number of 1 in 16-bit matching result. That is to say, the similarity is measured by the number of matched dominant orientations and dominant colors in and . The optimal split is parameterized by , which is optimized by maximizing class-label purity of each division. Algorithm 1 gives an overview of the optimization process.

Input:
 Samples at a node in :
 Exemplar set of :
 Block size of each sample:
 Maximum size of local template:
Output:
 Optimal splitting parameter:
(1) Initialization:
(2) for  each   do
(3) for   to   do
(4)  Randomly generate configuration:
    
(5)  Generate adaptive local DOT template in according to
(6)  Calculate the maximum and minimum value of :
    
    
(7)  for   to   do
(8)   Randomly select a threshold
(9)    Divide samples at into two subset:
    
    
(10)   Calculate class-label purity by (3)
(11)     if  
(12)     
(13)     , , , , ,
(14)     end if
(15)  end for
(16) end for
(17) end for

With the proposed adaptive local DOT template, the splitting function is not only robust to small local transformations, but also very fast to evaluate since the matching function can be further sped up using SSE operations, similar to [35]. More importantly, each tree in the forest provides both discriminative and complementary local information for classification.

4.3. Assembling Weak Classifiers with a Global Loss

The tree structured methods typically take the splitting function in each depth as a weak classifier, and multiple weak classifiers are assembled with boosting algorithm [1921, 28]. In order to make full use of the complementary information provided by multiple trees, we generalize this idea to Random Forest based method. To this end, the forest is treated as a whole and constructed in layer-wise fashion. Each layer is indexed with current depth of the forest . The split functions in one depth of the forest constitute a weak classifier. For assembling multiple weak classifiers with boosting method, we introduce a global loss. Suppose that the boosting classifier consisting of the first weak classifiers has been trained. It gives a predication about the class distribution of each sample. The new weak classifier is learned and added by minimizing the global loss computed by . With a weak classifier added, the forest grows until it reaches the maximum depth.

As described in Section 3, the forest includes trees with maximum depth . Each tree has a training set and an exemplar set . To obtain the final boosting classifier, the training procedure of boosting runs times. Different from standard Random Forest, the samples in our method are weighted and their weights are updated in each depth. The initial weight vector of each is set uniformly in , denoted as , and is the weight of the sample in with current depth . With weighted samples, the class distribution of a node used in (4) should take the weights into account. It is defined aswhere is a sample set, is the weight of sample in , and is an indicator function which returns 1 if and 0 otherwise. Then the class distribution of each node in the first layer can be computed by (6).

With the initial weights and class distributions, the splitting function of each node in depth can be learned by Algorithm 1. Suppose the forest with current depth has been trained. It can be considered as a boosting classifier, written as Here, is the weak classifier in depth ; the trained boosting classifier and each weak classifier are parameterized by and , respectively; is shrinkage factor [28]. can be estimated by the trained forest with depth :where is a sample set at a node where is routed by tree in depth . If , will be trained and added to in depth . The assembled strong boosting classifier becomes

As discussed in [21, 28], training the new weak classifier can be written as global loss minimization problem:where is a differentiable loss function; is parameter set fixed already; and is parameter set to be trained in depth . The minimization problem can be solved by updating the weights of samples in each tree with depth :With the updated weights of samples in depth , including parameters of each split function in can be learned by Algorithm 1.

In this paper, the nonconvex tangent loss function proposed by Masnadi-Shirazi et al. [23] is adopted. It is defined asAlthough any differentiable loss function can be used, the tangent loss function is proven more robust to label noise. With the tangent loss function defined above, (11) becomes

When the weak classifier in has been trained, the construction of the forest stops. The training procedure is summarized in Algorithm 2.

Input:
 Number of trees:
 Maximum depth:
 Training set for each tree:
 Template set for each tree:
(1) Initialize weights for samples of each tree:
     
(2) Compute class distribution of root node of each tree by (6)
(3) for   to   do
(4) Check stopping criterions for nodes in depth
(5) Split nodes in depth by Algorithm
(6) Update weight by (13) for each sample in each
(7) Calculate class distribution of newly created nodes in depth by (6)
(8) end for

5. Detecting Pedestrian with Proposed Method

In order to detect objects, we adopt a standard sliding window method in test image represented by DOT feature. Given a test window , it is passed through each tree in the trained forest according to the learned split parameters of each node, until reaching a leaf node in . The score of window estimated by is computed by class distribution of , which is calculated by (6) with . The final score of window estimated by the forest is defined by averaging all scores obtained by trees in the forest:The test window is classified as a pedestrian if exceeds the detection threshold which is found during the validation.

Particularly, we adopt cascade architecture to speed up the detection procedure. Using this approach, the windows which theoretically cannot achieve threshold are rejected as early as possible. The cascade architecture provides a significantly fast detection due to the fact that there is no need to compute the probability for all trees in the forest for a large majority of windows in the test image. Algorithm 3 describes the cascade detection procedure.

Input:
 Train forest:
 Test window:
Output:
 Label of
(1) Initialize:
(2) for   to   do
(3) Evaluate with by (15)
(4) 
(5) 
(6) if    then
(7)  reject
(8)  return −1
(9) end if
(10) end for
(11) return 1

6. Experimental Results

We evaluate the proposed method on two challenging pedestrian datasets: TUD pedestrians, INRIA pedestrians, where we provide a performance comparison with the other competing detection methods, including the best algorithms (as far as we know) in this field. We follow the PASCAL protocol [36] to decide whether the detected object is true positive; namely, the overlap area of detected bounding box and the ground-truth exceeds 50%. In order to avoid multiple detections for the same ground-truth, we reject the detections with centers inside the bounding boxes detected with higher score. Additionally, we analyze the two most relevant parameters of our method on TUD pedestrian dataset: the maximum size of local template and the maximum depth . All the experiments are performed with two Inter Core(TM) i5 3.2 GHz CPU, 16 G RAM, and Windows 64-bit OS.

6.1. Datasets and Experimental Setup
6.1.1. TUD Pedestrians

The TUD pedestrian dataset is a widely used benchmark for human detection. This dataset includes 400 training images and 250 test images with 311 pedestrians. Because the background in this dataset is mainly street; moreover, the diversities of backgrounds are low; we suggest collecting negative samples from INAIA dataset. In addition, we randomly select some positive samples from INAIA dataset.

6.1.2. INRIA Pedestrians

The INRIA pedestrian dataset is also a popular benchmark for pedestrian detection. This dataset is very challenging because of various intraclass variations and cluttered scenes. The training set includes 614 images with 1208 pedestrians and 1218 background images. In order to tolerate changes caused by poses, views, occlusions, and so forth, we flip the 1208 normalized pedestrian windows and get 2416 normalized positive samples. Negative training windows are sampled randomly from 1218 background images. The test set includes 288 images with 1126 pedestrians and 453 images without them.

During training the proposed model for pedestrian detection on these two datasets, all samples are normalized to pixels. As mentioned before, the size of each block is pixels. The overlap of the neighboring blocks is 4 pixels. Therefore, each sample can be partitioned into overlapping blocks. We set the number of trees , as discussed in [22]. Regarding the maximum depth and the maximum size of local template , we set and which are exhaustively optimized using a validation set. Additionally, we find that the discriminating power of local template with small size is too low. We reset the range of width and height of each local template as . During detection, the test image is partitioned in the same way, and each detection window slides with a block. To handle scale variations of object, we resized a test image to 20 scales with stride 1.05.

6.2. Experimental Results

Figures 3(a) and 3(b) demonstrate some detection examples of our method on TUD pedestrian and INRIA pedestrian dataset, respectively. They strongly prove that the proposed algorithm can detect people with large intraclass variability caused by different poses, size, and clothing under varying illumination and cluttered scenes.

To evaluate the performance of different methods evaluated on TUD pedestrian test set, the Receiver Operating Characteristic (ROC) curves are drawn to describe the statistical comparison of different methods. We use the definition in [36] that Recall and Precision are computed aswhere TP and FP are the number of true positive and false positive, respectively, during test and nP is the total number of positive in the test dataset. The goal of all detection methods is to improve the Recall and, in the meanwhile, also to improve the Precision. Unfortunately, they are mutually related to each other and mutually restrict each other. Generally, Area Under Curve (AUC) is used for measuring the performance of different methods according to corresponding ROC curves. The comparison result is shown in Figure 4. Applying the proposed method to detect pedestrian on this dataset achieves AUC = 0.908, which outperforms other competing algorithms. Furthermore, we compute the Equal Error Rate (EER) of different methods, as shown in Table 1. EER is the point on the ROC curve that corresponds to have an equal probability of missclassifying a positive or negative sample.

The INRIA pedestrian test set contains pedestrians with large intraclass variability. The statistical comparison of different methods is defined by miss rate at 1 false positive per image (FPPI). We follow the definition in [3] that the miss rate is computed aswhere FN is the false negative during test. For pedestrian detection, if the threshold of the classifier is low, the miss rate will decrease; at the same time, the number of false positive in each test image will increase. To make fair comparison, we should specify a statistical indicator and use another indicator to evaluate the performance of different methods. Generally, miss rate at is used, as shown in Table 2. The proposed method achieves miss rate of 0.12 at 1 FPPI, which is not as good as that of the state-of-the-art method [18]. Yet it is still quite competitive and, in particular, performs better than methods built on other block based descriptors, such as HOG and LBP. That gives direct evidence to the effectiveness of the proposed adaptive local DOT template. Regarding the gap between proposed method and the work in [18], the main reason is that the detector in [18] is built not only on local features, but also on the full object.

Regarding detection speed, we evaluate it on TUD pedestrian dataset. With fast DOT template matching and cascade detection architecture in our method, the mean detection time of one test image achieves 0.18 second which is faster than the HF’s 1.11 second and DPM’s 0.85 second.

6.3. Parameter Evaluation

In this section, we analyze the two most important parameters of the proposed method: the maximum size of local template and the maximum depth . The TUD pedestrian dataset is used for evaluation. To evaluate the maximum size of local template , we fix the maximum depth . Generally, represents the compromise between discrimination and robustness to local variations. If is too small, the discriminating power of selected local template is low; while is too large, local DOT template cannot tolerate variations caused by different views and articulated poses, partial occlusion, and changes in illumination. We depict the relations between and performance measured by AUC of PR curve in Figure 5(a). As expected, the performance increases with up to a certain limit and decreases if . To evaluate the second parameter , we fix the maximum size of local template . From another point of view, can be considered as the number of local templates in pedestrian assembled for classification. It is affected by two factors: the representativeness of selected templates and the way of assembling multiple weak classifiers built on these templates. The experimental results show that is enough for our method, as shown in Figure 5(b).

7. Conclusion and Future Work

We have proposed a novel compositional model for pedestrian detection in cluttered scenes. The key idea of our method is to assemble multiple weak classifiers which are defined by adaptive local templates. We achieve it by Random Forest. The forest is built in an iterative, layer-wise manner. The adaptive local templates are used for learning splitting functions in the forest, and all splitting functions in one depth form a weak classifier. Each newly weak classifier is learned and added by minimizing a global loss, with weights of samples updating. The final experimental results on two challenging pedestrian datasets indicate that the proposed method achieves the state-of-the-art or competitive performance.

As demonstrated in this paper, the extensive experiments show that our method is robust and effective for detecting pedestrians with various intraclass variations to some extent. However, we have to concede that there is room for improvement, particularly on challenging INRIA pedestrian dataset. The key of the problem is to model the combination relations of selected local templates explicitly during learning each weak classifier, which is used for providing information about poses, views, and occlusions. In the future works, we will continue our researches to solve this problem.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (61375038 and 11401060).