Abstract

This paper describes a solution to solve the issue of automatic multipedestrian tracking and counting. First, background modeling algorithm is applied to actively obtain multipedestrian candidates, followed by a confirmation step with classification. Then each pedestrian patch is handled by real-time TLD (Tracking-Learning-Detection) to get a new predication position according to similarity measure. Further TLD results are compared with classification list to determine a new, disappeared, or existing pedestrian. Finally single line counting with buffer zone is employed to count pedestrians. Experiments results on the public database, PETS, demonstrate the validity of our solution.

1. Introduction

Pedestrian tracking and counting is an extremely significant research in the field of computer vision. It plays a crucial role in many applications including the intelligent monitoring and traffic safety. Nevertheless, some challenges such as great variation of the pedestrian posture, background clutter, partial occlusions, and illumination changing complicate the issue.

Current state-of-the-art tracking algorithms can be roughly divided into two categories: generative and discriminative pattern. Generative methods [1, 2], generally, describe the characteristics performance via the generated model and then minimize the reconstruction error by searching the candidate targets. Compared with generative methods, discriminative approaches [35] distinguish target and background by trained classifiers to find a decision boundary between object and background. By adequately using both of the target and background information, this approach achieves higher tracking accuracy. In recent years, discriminative approaches have been obtaining vigorous developments.

TLD is a popular discriminative method [5]. It integrates an online detector to obtain a good ability of target redetect, which is important to realize continuous tracking in the case of target disappearance in tracking process. In addition, TLD can solve long-term accurate tracking by online learning.

TLD has been used in different tracking tasks. Zhang et al. [6] use TLD to realize dynamic gesture tracking. To initialize TLD algorithm, a specific gesture is manually marked by a bounding box in the first frame. Chen et al. [7] improved TLD to track a human in unconstrained environments, and the object needs to be initialized manually as well. Crane tracking and monitoring with stable TLD also show strong robustness and accuracy [8]. However all of these studies are under single target tracking conditions.

Pedestrian flow statistics is used to track pedestrians and count their number in the video [9], such as the surveillance of crossroads. Compared with above studies, many people may appear at the same time, leading to the request of multiple object tracking. Moreover, any pedestrian may randomly appear. As a result, the targets cannot be extracted manually and automatic searching for tracking objects is also a key for multipedestrian tracking and counting.

We propose a new method called improved TLD (ITLD) by introducing background subtraction to automatically obtain multiobject and then use the updating mechanism of tracking list to realize multiobject management. Figure 1 shows the framework of our system, integrating with counting module.

2. Multipedestrian Patch Obtaining

There are several ways to realize multiple objects obtaining. Cao et al. [10] use Haar features and AdaBoost algorithm to automatically detect human faces. Zhou et al. [11] proposed PE-TLD, after the process of ViBe and variance filter, hog features, and SVM classifier were used for automatic targets recognition. Besides, S. Sharma et al. [12] let users manually select the desired tracking objects in the initialization process.

In our system a coarse to fine strategy is employed to obtain multitarget. Firstly, a dynamic average background model followed by Ostu algorithm is used to extract candidate pedestrian patches. Next, we utilize pedestrian detection combining Haar-AdaBoost [13, 14] and Hog-SVM [15] to further exclude those nonpedestrian patches.

2.1. Extraction of Candidate Pedestrian Patch

Our dynamic average background modeling is based on Gauss statistics, considering the distinction of three color channel variances. In order to reduce modeling error, we make the different statistics in RGB space for all pixels. After the construction of background model, a simple subtract operation between current frame and latest background is employed to get a rough foreground part. In this step Ostu algorithm is used to obtain segmentation threshold. Then morphological processing is used to further extract those small regions and fill the gaps among foreground part. Finally, minimum circumscribed rectangle is used for getting possible pedestrian targets. The detailed process is shown in [16].

Given a pixel, suppose is the serial number of frame in video stream, is its position, , and are the values of its three color components; then this pixel can be represented by . The construct process of background model can be expressed by (1)-(3).where denotes the binary mask, which is calculated by where and represent the component mean and standard deviation of pixel from frame n-N+1 to current frame n. In our work, N is set as 300.

The recurrences of background model updating are shown in the following:

2.2. Pedestrian Confirmation

Haar and HOG are two excellent features to describe pedestrians. In each frame, to obtain higher performance, we integrate HOG feature with SVM classifier and Haar feature with AdaBoost classifier (shown in Figure 2) to determine if an image patch includes a pedestrian or not. An image patch will be thought to include a pedestrian, if either classifiers output yes.

In this step, a patch set is used to record the pedestrian patches in frame j, where x, y, w, and h are the position and size of .

3. Multipedestrian Tracking with TLD

The output of pedestrian classification servers as the input of TLD algorithm, which can solve the automatic selection of multiobject. Since pedestrian presence or disappearance from the camera is random, the management of multiobject list, including inserting, deleting, or maintaining, becomes important.

3.1. Single Pedestrian Tracking with TLD

Single pedestrian tracking with TLD include three components, namely, tracking, learning, and detection [5]. Data preparation is the first step to implement TLD framework. Given a target , use overlap area under different scales to choose positive and negative samples. Within each scale, we can get a collection of patches from top to bottom, left to right. Then calculate the overlap ratio between a patch and . Add the best 10 patches with maximum overlap ration (larger than 0.6) into the positive sample set. For those patches with overlap ration smaller than 0.1, we put them into the negative samples set. Before adding, each patch is resized into normalized size.

Thus object model in frame can be expressed as . is the first positive patch added to the set of . And is the positive sample added last so far. After learning initialization, we can use the TLD framework shown in [5]. In model update process, relative similarity S(a,b) is used to measure the similarity between objects a and b. The similarity between the image patch and setis calculated as (4)-(7). where and are the similarity between and positive training set and negative training set .

Expression (7) is used to decide whether a new patch is a positive one.

The output of integrator is measured by conservative similarity shown in the following.where is the similarity of the first 50% of the positive patches.

For each , we use single TLD framework to get its new position if it is visible in frame j. If n objects are visible in current frame, we can get a new list and record it as . Here x, y, w, and h are the position and size of the bounding box given by TLD.

3.2. List Updating for Multiobject Tracking

Because of the random of pedestrian’s appearance and disappearance, the dynamical maintenance of pedestrian list becomes a key in multipedestrian tracking and counting. For example, how to decide a target is a new or an existing one, especially for the case of occlusion or out of vision and then back again. Suppose every target is independent; we design a mechanism for the tracking list updating.

We record the information such as position and size and lifespan from the first appearance to current frame j as the trajectory of a tracking object. Tracking list T is the set of all tracking object. For current trajectory T, we note its tracking target as Ti=, tbold, len, vlen, ivlen, rd, . Here, is the position and size of current output by TLD in frame j. is the last record of position and size before current. And len, vlen, and ivlen are, respectively, the length of total frames (from first appearance to current frame), visible frames, and consecutive invisible frames. and are the labels to describe if a pedestrian passes the left or right border of buffer zone (introduced in Section 4). By comparing current and the existing object trajectory list T, a new tracking list can be updated.

Our tracking list updating can be divided into two steps: correlation matching and list adjustment. For each trajectory, correlation matching step is to find the most similar patch in the detection set . The similarity between patches a and b is measured by the Euclidean distance between their centroids. If the similarity satisfies (9), patch is accepted as the most resemble appearance of .

List adjustment process is to update . For those matched trajectories we update their records with new parameters. For those unmatched patches in , we regard them as new pedestrian targets. Then we add them into as new trajectory elements. As the trajectories disappeared or unmatched for requirement threshold, we incline to view them as vanishing from the video and then remove them from the set . The detail updating flow of tracking list is shown in Algorithm 1.

Input: Detected targets in current frame j; Tracking list ;
   Current bounding box set given by TLD
Output: updated tracking list
1 Trajectory Initialization. Initialize T according . If frame number is 1 then return
after initialization.
  
  
2 Correlation computation. For each trajectory , compute the distance similarity to each
patch in the detection set , then find its most similar patch according (9).
3 Trajectory Updating. If is found, then update the trajectory based on
according to (10).
Else, update some parameters of trajectory according to .
4 Trajectory Deletion. Search throughout the updated trajectory set T. If a trajectory
satisfies following condition, then delete it from T.
5 Trajectory Insertion. For those unmatched patches in , insert them into
to form new trajectory records. Suppose the index number is m in T for a new record. Then the
parameters for are set as following.

4. Pedestrian Flow Counting

Tracking trajectory is the source of pedestrian counting. So far, single or double line counting is the popular way to count pedestrians (moving left or right). Considering the acquirement of real-time, stability, and accuracy, we choose single line counting with buffer zone shown in Figure 3 to count pedestrians.

To avoid the direction statistical error caused by the wandering pedestrian near the counting line, buffer zone is introduced. An image frame only has one buffer which is a small range centered on the counting line (the center line of the video). And only when a target crosses the whole zone, does the direction counting continue.

Let the margins of the buffer range be and . The way to count right or left moving pedestrians is shown in the following.where

Once target has been counted, moving label and will be reset to zero immediately.

5. Experiments and Analysis

Two experiments are designed to test the performance of our multiobject tracking and pedestrian counting on the database of PETS [21]. PETS is a public dataset which consists of multisensor sequences containing different crowd activities. It has been widely used to test the performance of new or existing systems of pedestrian detection and tracking within a real-world environment. We select ten clips for pedestrian counting from the PETS. The longest clip has 1189 frames and the shortest is 276 frames.

5.1. The Evaluation of Multitarget Tracking

MOTA (multiple object tracking accuracy) and MOTP (multiple object tracking precision) [22] are adopted to evaluate tracking process. MOTP evaluates the alignment of tracks with the ground truth. It measures the precision with which objects are located using the intersection of the estimated region with the ground truth region. And MOTA combines all missed targets, false positives, and identity mismatches and is normalized with the total number of targets (100% corresponds to no errors) [17]. Table 1 presents the comparison results of our method and some of the state-of-the-art approaches for multiobject tracking.

From Table 1, we can observe that our tracking approach can achieve a competitive result, especially the metric value of MOTP. For [17, 18], MOTP and MOTA cannot achieve good values simultaneously. When one measure increases, the other metric decreases obviously. Moreover, the MOTA of Berclaz approach is the highest, because it uses probabilistic occupancy map (POM) to create background and detection data. Compared with our basic foreground segmentation process without considering any prior real data, POM model uses some empty background images in PETS.

As [19], it can obtain a good balance between MOTA and MOTP, but its metric value is a little lower than that of our method. The reason may be attributed to the performance of TLD algorithm. TLD combines learning, detection, and tracking together for a tracking task with parameters online updating. To achieve high detection accuracy, P-N learning paradigm is used to exploit the temporal and spatial structure in data, leading to the mutual compensation of missed detection and false alarms and the promotion of tracking accuracy and precision.

We can also see that [20] can get a bit higher performance than our method. To use the holography information among multiple view, [20] integrates the crowd simulation into traditional single camera method. However, the two metric values in [20] are computed by manually assigning the size and position of the initial patch for each tracking. In contrast, our method extracts foreground and initializes each tracker automatically. This fully automatic step may decrease the performance of our tracking system to some extent. In order to improve the performance, in the initialization stage, we can use more proper detection method and better background modeling method in our future work.

5.2. Statistics of Pedestrian Flow

Table 2 displays the results of our statistics of pedestrian flow. L and R mean the number of pedestrians walking toward left or right. In our 10 clips, the illumination of 1 and 2 is inadequate, clips 7 and 8 have many pedestrians, and 9 and 10 have serious occlusion problem. The other clips are in normal condition.

From Table 2, it is easy to be observed that our system obtains higher accuracy during normal condition (especially 3 and 5). For those clips with crowed pedestrians or frequent occlusion (clips 7, 8, 9, and 10), the accuracy is about 81% to 84%. For those clips (clips 1 and 2) with inadequate illumination but less objects and occlusion, the accuracy is over 86%. And for the all clips, our average statistical accuracy is 87.4%.

6. Conclusions

To realize multiobject tracking, our system combines TLD tracking algorithm and dynamic average background modeling. The former can track objects with long term and robustness in real-time. The latter with confirmation module can automatically localize candidate multiobject, which are further tested by pedestrian detection. By comparing the TLD output and pedestrian detection results, we can manage pedestrian records easily, such as updating parameters, inserting new objects, and deleting those disappeared ones for long time. Counting with buffer zone can decrease the influence of the wandering of pedestrians around counting line. The accuracy and stability of our system have been proved by several experiments and analysis. Future work will focus on how to improve the tracking accuracy for high-density crowd and better robustness.

Data Availability

All the data used in our experiments are from the public database PETS and can be download from http://www.cvg.reading.ac.uk/PETS2009/a.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is partially sponsored by the National Key Development Plan of Fundamental Research 2017YFB1302203.