Abstract
The development of the Internet of Things (IoT) stimulates many research works related to Multimedia Communication Systems (MCS), such as human face detection and tracking. This trend drives numerous progressive methods. Among these methods, the deep learning-based methods can spot face patch in an image effectively and accurately. Many people consider face tracking as face detection, but they are two different techniques. Face detection focuses on a single image, whose shortcoming is obvious, such as unstable and unsmooth face position when adopted on a sequence of continuous images; computing is expensive due to its heavy reliance on Convolutional Neural Networks (CNNs) and limited detection performance on the edge device. To overcome these defects, this paper proposes a novel face tracking strategy by combining CNN and optical flow, namely, C-OF, which achieves an extremely fast, stable, and long-term face tracking system. Two key things for commercial applications are the stability and smoothness of face positions in a sequence of image frames, which can provide more probability for face biological signal extraction, silent face antispoofing, and facial expression analysis in the fields of IoT-based MCS. Our method captures face patterns in every two consequent frames via optical flow to get rid of the unstable and unsmooth problems. Moreover, an innovative metric for measuring the stability and smoothness of face motion is designed and adopted in our experiments. The experimental results illustrate that our proposed C-OF outperforms both face detection and object tracking methods.
1. Introduction
With the development of AI technology [1–3], IoT [4–7] is receiving more and more attention from academia. It emphasizes that all objects connected to the internet (including people and machines) have unique addresses and communicate through wired and wireless networks and have been deeply integrated into humans’ daily life. For example, a doctor can conduct the diagnosis remotely or even complete the surgery via a telemedical system [8, 9]; by collecting personal information, smart devices may provide personal recommendations which are most suitable for him/her [3, 10]; and even the satellite in the universe can be utilized more efficiently for better serving mankind [11]. However, the smarter the humans’ life is, the more dangerous the privacy is. Every smart device is “monitoring” you, so personal data protection and privacy-preserved problems should be paid more attention to. Especially the release of GDPR in EU and EEA in 2016, more and more researchers have been digging into privacy-related works [12–19].
The development of IoT-based MCS drives a sharp increase of human face-related techniques, such as face detection, face tracking, and face recognition. Applications of beauty cameras, security access, surveillance and tracking suspect, etc., have been widely used around people’s life, for example, smart city and smart campus. It is with no doubt that accurately detecting and tracking faces are essential steps for the aforementioned missions. Additionally, stably and smoothly tracking face bounding boxes from a sequence of continuous images is also required for some special missions in the field of IoT-based MCS, e.g., face biological signal extraction, silent face antispoofing, and facial expression analysis, as stable and smooth face bounding boxes captured along frames can reduce the signal noise significantly.
Regarding the traditional visual methods, a lot of prior face tracking methods [20–25] take tremendous spirits on feature engineering and color spaces. For the long-term tracking of human faces in the unconstrained video, face tracking has been generally treated as common object tracking, e.g., [26] is a typical method which comes from TLD [27, 28] and also is one of the earliest attempts to apply the tracking-by-detection diagram for the face tracking task. Although common TLD can also deal with face tracking work, [26] upgraded it to be more robust even when viewpoints change. In detail, it adapted a frontal face detector from [29] which is the state-of-the-art method at that time. A validator was deployed on the top of the detector outputting confidence that is how the current image patch corresponds to a face. [30] proposed a face tracking approach where optical flow information is incorporated into the Viola-Jones face detection algorithm [31]. Its outputs proceed to build a likelihood map where face bounding boxes are extracted. FT-RCNN [32] is an efficient face tracking method based on Faster R-CNN [33]. A tracking branch is conducted into Faster R-CNN and jointly performs face detection and tracking, but its running time cost is expensive.
Face detection methods are eligible to do face tracking. However, face tracking turns to more concentrate on frame-wise face pattern connection. Thus, as for face tracking, the relationships of the patterns between frames are taken into consideration rather than detecting faces in each individual frame naively. In this paper, we present a novel method for super real-time and long-term face tracking by combining CNN and optical flow (see Figure 1). There are three principal components: a cascade lightweight face detector that takes responsibility for generating an initial face bounding box, a face tracker based on optical flow [28], and a face identifier (a very shallow FCN) who provides face confidence for binary classification. The face identifier guarantees that the face tracker does not focus on nonface patch. The optical flow field is always continuous and uniform; the face bounding boxes generated from our method are extraordinarily stable and smooth. Additionally, C-OF can be easily transferred to any other missions which meet the stable tracking requirement, such as object tracking and person reidentification. Overall, we make five main contributions: (i)We novelly combine lightweight deep CNN with the optical flow to substantially reduce the running time cost, which achieves stable, smooth, super real-time, and long-term face tracking on both CPU and edge computing devices(ii)Compared with the deep CNN face detection method, C-OF output fairly stable and smooth bounding boxes and enhance the performance of many applications, such as face biological signal extraction, silent face antispoofing, and facial expression analysis(iii)A lightweight FCN which only contains five convolutional layers is designed for face identification to guarantee the tracking accuracy(iv)We innovatively design a metric to quantify bounding box stability and smoothness regarding the scale and position changing of bounding boxes. The experimental results illustrate that our proposed C-OF outperforms both face detection and object tracking methods(v)The implementation of C-OF is released on GitHub (https://github.com/HandsomeHans/C-OF) for those who are interested in further research work and application on the commercial product

The rest of the paper is organised as follows: in the next section, related work on face detection and object tracking is presented. Our method of how to combine optical flow on a CNN face detection method and insert a lightweight FCN to identify the face box is described in Proposed Method. The details of the experiment and discussion on the results are provided in Experiment and Discussion followed by Conclusion.
2. Related Work
2.1. Face Detection
As the huge success of deep learning, traditional methods face a tough situation in some particular missions. However, they still matter and have many advantages that are worth to be learnt from. Commonly, they extract hand-crafted features to train a classifier and then deploy a kind of sliding-window method to locate the face. For example, with the combination of Haar features and AdaBoost [34], Viola and Jones [35] deployed a cascaded face detector, which performs high recognition accuracy and fast running time. Benefitting from [34], many excellent methods [36–38] are proposed afterwards. Felzenszwalb et al. [39] used mixtures of multiscale models to detect an object, which inspires many face detection approaches, e.g., [40–42]. As the aforementioned methods use hand-crafted features, they all have bad generalization. That is to say, in complex scenarios, the performance of those methods slumps sharply.
Since Krizhevsky et al. [43] won the ILSVRC, deep learning and CNN have had explosive progress on vision missions. CMS-RCNN [44], Face R-CNN [45], and FDNet [46] adopt many novel strategies with regard to face detection based on Faster R-CNN [33]. SSD [47] is another commonly used way of face detection. Methods based on SSD usually lead to high accuracy and efficiency, e.g., a tiny network is designed in FaceBoxes [48], which attain real-time performance on CPU; FANet [49] uses FPN [50] and merges high-level and low-level features in a low computing cost to train a face detector. On the other hand, cascade CNN methods also show their superiority on the face detection mission. MTCNN [51] consists of three lightweight cascade CNN models to jointly detect face and landmark. PCN [52] upgrades MTCNN by adding an orientation branch to be able to output the face rotated angle. Deng and Xie [53] proposed a nested CNN-cascade learning algorithm that adopts shallow CNN architectures. All these are face detection methods, which all focus on single image representation only. That is why on a continuous video, bounding boxes generated by them are unstable and unsmooth.
2.2. Object Tracking
Face tracking can be considered to be a special category of object tracking. Most scientists implement object tracking methods for face tracking as control experiments. In the beginning, an initial state such as a bounding box of a target object is given, and then, feature extracting and pattern matching methods are conducted in all the subsequent frames. Object tracking has been progressing all the time, as the release of many benchmark datasets and competitions including RGB-T [54], MOT16 [55], and Lasot [56], as well as the development of deep learning. CF [57] has been widely used and inspired a lot of good work in tracking missions. It proved, for the first time, that there is a connection between ridge regression and classical correlation filters. The work accelerated the cost expensive matrix algebra to fast Fourier transforms with computational complexity. In the meantime, the KCF was first presented and a solution of computing kernels on shifts was proposed as well based on radial basis and dot product. [58] proposed MOSSE which greatly improved the performance of tracking methods with respect to CF. It reduced the computational complexity, and the accuracy increased at the same time. However, it only concerned gray-scale features that this kind of low dimension feature space does not have a good representation. On the other hand, it is unable to adapt object scale variance as it concentrates on translational motion of the center point of the target object between frames and does not take into consideration the scale change of the target object reflected on the screen in the process of moving scale variance. To this end, Danelljan et al. proposed DSST [59] making an improvement on MOSSE by deploying fHOG [39] features instead of gray-scale features to increase the dimension of features from 2 to 28. What is more, the object scale variance is concerned in DSST.
Apart from traditional object tracking methods, CNN-based methods have had great progress and outperform traditional methods a lot on the public benchmarks. [60] introduced a generic object tracking network using a regression mechanism by watching videos offline of objects moving in the world. To be specific, the regression-based tracking network only requires a single feed-forward pass through the network to directly regress the location of the target object. Zhu et al. [61] made progress on Siamese networks, which conduct tracking through similarity comparison strategy, by learning distractor-aware Siamese networks for accurate and long-term tracking. MDNet [62] is one of the most successful generic object tracking methods. It consists of a shared CNN, which is trained on a large set of videos with tracking ground truths, for feature representation extraction. After training, all the branches of domain binary classification layers are replaced. Then, the model was fine-tuned online during tracking to adapt to the new domain. Regarding bounding box regression, they set up an online training linear model to generate the final bounding box.
3. Proposed Method
The details of the proposed C-OF face tracking method are shown in Figure 2. It consists of three principal components: face detector, face tracker, and face identifier. In the following parts of this section, the first part presents the lightweight face detector. In the second part, the implementation logistics of optical flow are given. The face identifier is provided in the last part.

3.1. Lightweight Face Detector
Same with other face tracking methods, a face detector is essential to figure out an initial face bounding box. We adopt a cascade CNN referring to [51], which is only for face detection without facial landmark prediction. In the first stage, a FCN [63] is deployed to obtain the candidate face bounding boxes. As the networks output not only vast bounding boxes but also confidence density for binary classes of face and background, we predefine a threshold to filter some boxes which have low confidence. Then, highly overlapped boxes are merged by NMS. In stage two, the candidate faces cropped from the input image are fed into the second CNN. Abundant false positive faces are dropped out, and NMS is conducted again. In the last stage, the output is generated the same way as stage two. After NMS, we have the final face box. In our face detector, each part takes its own attention on tackling the detection problem. The first part more focuses on outputting vast face candidates. Then, for the second part, it has to filter false positive faces which means this part is concentrated on face identification. The last part not only focuses on face identification but also puts a lot of attention on box regression. More details of the three CNNs are shown in Table 1.
3.2. Optical Flow Face Tracker
Once the initial face bounding box is obtained, it comes into the tracking part. For the general face detection methods, every frame is handled separately, and there is no more temporal information taken considered, so face bounding boxes perform to be very unstable. The sharp shaking of face bounding boxes makes it more difficult to tackle the problem of critical face analysis missions. To this end, a Median-Flow tracker is deployed and collaborates with a bounding box regression module to locate the position of the face in the current frame with respect to the last frame. Basically, a grid of points is uniformly selected from the last face patch, namely, . Then, the motions of these points between the current frame and last frame are estimated by the pyramidal Lucas-Kanade tracker [64] in two directions. and , where and represent the predicted points in forward (from last to current fame) and backward (from current to last frame) directions, respectively, from pyramidal Lucas-Kanade tracker, where is the number of points. In other words, the forward predicted points is calculated from the last frame, current frame, and last frame’s uniformly selected points ; the backward predicted points are from the two frames and the current frame’s uniformly selected points . As for the pyramidal Lucas-Kanade tracker, we set the size of the search window at each pyramid level to be 4 by 4, and two pyramid levels are used. The termination criteria of the iterative search algorithm are set to have a 20 maximum iteration number and 0.03 convergence threshold. In order to filter the points and to estimate the offset of face patch, normalization cross-correlation between last and current frames is performed firstly. Any point whose value is smaller than the median similarity value is dropped out. Then, the median value of is used to further filter points in sets and . The point whose value is larger than the median value is dropped out. While having the filtered points and , the coordinate offset of the current face box against the last face box is the median Euclidean distance between and :
Hence, from the last face box and coordinate offsets, we have the current face box. In this way, face boxes are significantly stable and smooth along with frames than those generated from the common face detection method.
3.3. Face Identifier
Face recognition is in general brought into vast focus by the success of CNN on vision missions, and various methods have been presented already. In order to filter the representation in the bounding box generated by the optical flow face tracker, we show a very simple face identifier that works very well in distinguishing the background. Table 2 shows the architecture of the proposed face identifier which has a minor difference against the second part of the aforementioned face detector. The second part of the face detector outputs six values, four for face box coordinate and two for face and background probabilities. We change the output layer to only generate binary probabilities. As the face in a sequence of frames may vary a lot, the face tracker may fail in tracking the face and output a wrong object. So, the face identifier is essential; the main goal of it is to filter the false positive candidates to fit well with its motivation. Therefore, we change the number of output values to two, and the probability distribution for face and background is obtained by a Softmax layer which follows the last convolutional layer. In order to make the face identifier capable of dealing with different sizes of face bounding boxes, we replace two fully connected layers with two convolutional layers to make the model be a FCN [63]. FCN is a specially designed neural network for semantic segmentation. By replacing all the fully connected layers with convolutional layers, FCN breaks the limitation of fixed input size. That is because a fully connected layer needs an input with a fixed dimension to fit with its weights, while a convolutional layer with a 1 by 1 kernel size has no need of fixing the input’s dimension. Let the kernel number in a convolutional layer be the same as the hidden node number in a fully connected layer; then, this convolutional layer can replace the fully connected layer directly and break the limitation of fixed input size. Its ability to deal with the arbitrary scale of input images has been spread to other vision missions, such as object classification and detection.
4. Experiment and Discussion
4.1. Experimental Setting and Dataset
A number of experiments on face tracking were conducted to evaluate our proposed C-OF. Other than C-OF, we reproduced MTCNN [51] and MDNet [62] for comparing the tracking performance. For Python implementation, all neural network models are implemented using PyTorch [65] framework, and the source code has been made publicly available (https://github.com/HandsomeHans/C-OF) for reproducing the results. The hyperparameters for MDNet and MTCNN are the same as their official implementation. We also implement C-OF via C++ with ncnn (https://github.com/Tencent/ncnn) which is a high-performance neural network inference computing framework optimized for mobile platforms. The devices we used in the experiments are Intel Core i7-8700K and Nvidia GTX TITAN X. Details of running time using Python and C++ implementations are presented in Running Time.
In terms of the dataset, four long-term videos are recorded for the experiments, as there are no public benchmark aims for stable face tracking. The four videos are recorded via different conditions: active camera, active human, static human, and active illumination, respectively. As for the active camera and illumination, we let the actor be static and randomly move the camera and light source in front of the face. And for the active and static human, we first ask the actor to talk to the cameraman and act in a freestyle. Then, we ask the actor to stop acting and sit statically in front of the camera. We split each video into five clips, each clip is about one-minute long. The recording device is iPhone X with a 12-megapixel rear camera. We resize each frame to resolution. The three different methods can detect and track faces all the time in each clip. So, the only difference is the scale and position changes of bounding boxes frame by frame. The download link of this testing dataset can be found in the GitHub repository as well.
4.2. Result Discussion
4.2.1. Stability
In the perspective of evaluating the stability and smoothness of face boxes from a frame sequence, comparing with ground truth is not the principal aspect. Besides, to the best of our knowledge, there is no general metric to quantificate the stability and smoothness of the bounding box. Hence, we naively consider the moving route of the bounding box’s corner and center points to evaluate how the bounding box changes along with frames regarding scale and position. We first quantificate the stability by judging the change of width and height and position of the center point: where and are the coordinate of the center point, respectively, is the absolute moving route of the center point, is the total number of frames, and and are width and height of the bounding box, respectively. The summation offsets of width, height, and center point illustrate the change of scale and position of bounding box collaboratively. A smaller value means the box moves a short distance or has a little change of width or height. Table 3 gives all the results of stability for MDNet, MTCNN, and C-OF, respectively, on aforementioned four videos with different conditions. It can be viewed that our proposed C-OF significantly outperforms the other two methods in all experiments. Faces in static human videos are the more stable, where the minimum values take in place; for example, MDNet gains its minimum value of 1.425355, MTCNN’s is 1.994141, and C-OF’s is 0.245097 which is the global minimum value as well. Also, the maximum value of C-OF, which is 0.321575, on static human videos is even smaller than the minimum values of MDNet and MTCNN, which are 1.425355 and 1.994141, respectively.
4.2.2. Smoothness
On the other hand, to further quantify the smoothness, we design another function, which mostly focuses on the scale change of the bounding box: where , and are the absolute moving route of top left, top right, bottom right, and bottom left points, respectively. In this function, we empirically consider the ratio of corner and center points’ absolute moving routes. By observation, we found that a bounding box that moves a long distance usually comes with a change of its scale. In Equations (4) and (5), is larger than 1, as is a nonnegative value. The mean route of corner point is larger than the route of center point , so logarithm of the ratio of and is always a nonnegative value. In conclusion, we say that the box that moves a short distance without or with little scale change may gain a smaller value. We report the smoothness values in Table 4, where our proposed C-OF is superior to MDNet and MTCNN all the time.
4.2.3. Motion Tracking
The stability and smoothness can be observed clearly on the image sequence, but it is not convenient to show out the image sequence in a paper. To this end, visualizing the motion tracking of any specific point from the bounding box is a feasible way. Figure 3 shows some visualization examples for the motion tracking of the center point in the bounding box from three methods. It is obvious that the proposed C-OF has far more smooth lines than MDNet and MTCNN, which illustrates that motion tracking of face boxes generated by C-OF is more smooth. In graphs (k) and (l), the outlier means a wrong face box is taken in place. All of the visualization examples are presented in Figures 4–7 for your reference.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)




4.2.4. Running Time
Running time is another principal aspect that commercial applications mostly take into account. Benefitting from lightweight models and optical flow method, our proposed C-OF is super real-time even on CPU, while the typical deep learning model commonly depends on GPU to attain a sufficient performance. Table 5 shows all the experimental results of approximate running time for MDNet, MTCNN, and C-OF on different computing devices and resolutions. The input image is resized to resolution. Regarding Python implementations, as MDNet needs to fine-tune the model online, its running time is far slower than the other two methods. Our proposed C-OF is super real-time of approximately 200 FPS and spends 5 ms and 6 ms less than MTCNN on CPU and GPU, respectively. Note that both MTCNN and C-OF have no massively parallel computing, in which case PyTorch using GPU performs worse than using CPU. That is why the running time of MTCNN and C-OF conducted on CPU (15 ms and 10 ms) is faster than GPU (17 ms and 11 ms). Typically, C++ implementation is more common than Python implementation in the industry field, so we also provide a C++ version of C-OF, which is also publicly available. We say C-OF is hyper real-time when using C++ and ncnn. Although benefitting from C++ and ncnn, MTCNN has a speedup from 15 ms to 10 ms on CPU; C-OF has a five times progress from 10 ms to 2 ms and achieves 500 FPS. If we change the input resolution from , MDNet and C-OF have little running time increment or even a reduction. For MDNet, all the candidate face boxes are cut off, resized to a fixed resolution, and then fed into the model for fine-tuning. So the size of the input image does not matter, cause the number and the size of the candidate face boxes are predefined. In terms of C-OF, the resolution increment definitely may slow down the detector part; however, it only runs very limited times in one experiment. Other than the detector, optical flow is insensitive to the size of the image, and the identifier is too light to present out the performance loss. On the other hand, MTCNN has a normal running time increment from 15 ms to 32 ms on CPU and 17 ms to 22 ms on GPU as the input image’s resolution changes. For the C++ version, MTCNN has a normal running time increment as well from 10 ms to 31 ms. Same as the Python version, C-OF runs 1 ms more than resolution (3 ms vs. 2 ms). Overall, no matter what implementation language or computing devices we use, our proposed C-OF is the fastest one, and the running time is more than sufficient for the commercial application.
5. Conclusion
In this paper, we proposed a stable, smooth, super real-time, and long-term face tracking system using lightweight CNN and optical flow, namely, C-OF, which consists of a face detector, face tracker, and face identifier. The method is aimed at solving the bounding box shaking problem, which commonly occurs in deep learning methods. We also optimize the system to make it run faster than most face detection and tracking methods. The experimental results show that C-OF can produce stable and smooth face boxes on a long-term face sequence with super even hyper real time. We design two functions to quantificate the stability and smoothness individually, and C-OF is superior to both MDNet and MTCNN. Meanwhile, we visualize the center point motion tracking of face boxes to observe the path the box goes and conclude that C-OF has a far more stable and smooth path line with a little crook. In the end, we make the Python and C++ implementations of C-OF public available for people who are interested in the work.
Data Availability
The testing data (20 mp4 files) used to support the findings of this study are included within the article, which also can be downloaded from https://www.dropbox.com/sh/fcks3k2l9xs36ze/AABlXm3FY3pMzStNrPktYKdRa?dl=0.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study is supported by the National Key Technology R&D Program of China (No. 2019YFC1606401), Beijing Natural Science Foundation (No. 4202014), and Humanity and Social Science Youth Foundation of Ministry of Education of China (No. 20YJCZH229).