Abstract

Visual object tracking takes an important role in realistic applications, such as video understanding, unmanned auto vehicles, and autonomous robots. Although the Siamese-based tracker has achieved good performance in tracking tasks, the existing methods using initial template or updating template with simple strategy result in the performance degradation of the model when the target varies in realistic scenarios such as target occlusion, scale variation, and deformation. In this paper, we propose a visual tracking framework with adaptive template update and spatiotemporal attention, named SiamAttnAT. Specially, we propose a historical template selecting strategy and a template adaptively generating method for robust tracking. In addition, we apply the proposed mechanisms to the employed baseline SiamRPN++. Extensive experiments and comparisons with state-of-the-art trackers on short-term and long-term visual tracking benchmarks including VOT2018, OTB-100, UAV123, NFS, and LaSOT show that the proposed framework achieves the outstanding performance with a considerable real-time speed, verifying its efficiency and effectiveness.

1. Introduction

Visual object tracking (VOT) is one of fundamental tasks in computer vision, which applied widely in automatic driving, human–computer interaction, robot sensing, visual surveillance, and augmented reality. VOT aims to automatically locate a specific target object at each frame given its initial location in a changing video sequence [1]. Recently, VOT has received considerable attention and gained significant progress [2]. However, it is still a highly challenging task, which is to learn an appearance model of an arbitrary target object online, due to complicated scenarios including target occlusion, appearance changes, scale variation, deformation, and environmental aspects such as motion blur or illumination changes [3].

In [4], the tracking approaches are roughly categorized into traditional methods and deep learning (DL)-based methods. The former employs various visual frameworks to model the appearance and motion of a target, including discriminative correlation filters (DCF), silhouette tracking, Kernel tracking, and point tracking. The latter employs either deep off-the-shelf features or end-to-end networks. There are two most prominent paradigms: discriminative correlation filters (DCFs) and deep Siamese tracking methods. The DCF-based trackers [57] attempt to train filters to learn the correlation between the target and background appearance. The target is then detected in consecutive frames by convolving the trained filter via the fast Fourier transform (FFT) [5]. The DCF-based trackers have the advantages to update the appearance model over time through the loss function [5]. However, these trackers based on the discriminative correlation filters (DCF) have inherent limitations [6]. In general, DCF-based trackers have inflexible assumptions about training samples for learning an online classifier, which leads to undesirable boundary effects that severely degrade the quality of the target model [4].

With strong capacity of learning and powerful deep features, deep learning technologies have significantly advanced the performance of visual object tracking [8], especially the emergency of the Siamese trackers. Trackers based on Siamese networks [914] have received significant attention in visual tracking community. The Siamese-based trackers employ two subnetworks and convert tracking problems into similarity learning between the target template and the current image. The features of the target template and the search region can be extracted through training a deep neural network offline on large datasets. Then the trackers match the template and the search region and output the similarity with the target for each location in the search region. Due to the competitive performance, especially well-balanced accuracy and speed, Siamese trackers have drawn the increasing attention of many researchers. Some excellent Siamese-based tracking frameworks, such as SiamRPN [15], SiamRPN++ [10], SiamBAN [16], SiamCAR [17], and SiamCorners [18], have demonstrated their superior capabilities in different aspects of visual object tracking by introducing semantic network, cross correlation, region proposal network, feature pyramid network, corner prediction, and so on [3, 4]. This work is mainly based on Siamese trackers.

Although the Siamese trackers have achieved remarkable progress, the tracking is still a highly challenging task due to the vast amount of challenging scenes [15]. The appearance of target may change due to the complicated real scene. And it will result in the low similarity between initial template and the search region. Therefore, the model is ambiguous to locate target when facing the scene changes. Some Siamese trackers employ the object template, which is initialized in the first video frame and keep it fixed during the remainder of the video. Therefore, this method is prone to lead to early failure of the tracker [3]. While some Siamese trackers adopt update strategies to update the object template for robust tracking, the update strategies are inadequate to adapt to the changes of different tracking situations. Additionally, it is important to consider both spatial and temporal information in video sequence for object tracking. However, previous trackers do not simultaneously model the spatiotemporal relationships.

Inspired by the prior works, we propose a novel template updating strategy and adapt it to Siamese attention network name SiamAttnAT. Specially, our method fully utilizes the reliable prediction of the model to select the historical templates actively for generating the current template subsequently. So the feature representation of the template is enhanced to cope with the complicated tracking scenarios. Besides, spatiotemporal attention mechanism is adopted to the tracker to improve the tracking accuracy and robustness. Figure 1 shows the qualitative comparison of our SiamAttnAT tracker with three state-of-the-art trackers on OTB100 benchmark. Summarily, the main contributions of this work are three-fold.(i)We present a novel template selecting strategy in feature level. It can determine which historical frame feature should be selected for adaptive template generating.(ii)We propose a learnable and adaptive template generating method, which is used to update the current template online to adapt to the tracking object variances.(iii)We adapt above methods to a well-known tracking network, SiamRPN++. We verify the proposed method on five short-term and long-term tracking benchmarks and has achieved obvious improvement over existing methods

In this section, we briefly introduce recent development of object trackers, template update, and attention mechanism. We focus on the Siamese trackers in this work.

2.1. DCF-Based Trackers

With the efficiency and expansibility of correlation filter theory, correlation filter-based object trackers have recently shown excellent performance on several tracking benchmarks. Among the approaches, discriminative correlation filter (DCF)-based trackers are the most prominent. These trackers learn a correlation filter from example patches of the target appearance to discriminate between the target and background appearance [3]. With the adoption of deep learning technologies [12], multichannel formulations and spatial constraints, attention mechanisms in correlation filter framework, the performance of correlation filter-based trackers have been notably improved [8]. In ASTCA [5], a learnable and adaptive spatial-temporal context-aware network is proposed to reduce the influence of boundary effect and improve the tracking accuracy. In CFML [19], a metric learning model is presented to solve the target scale problems in correlation filter framework. SRDCF [20] tracker proposed a penalty function to reduce the filter coefficients of the boundary regions of the image patch. More specially, in [1], a probabilistic regression formulation is proposed to predict the conditional probability density of the target state instead of confidence-based regression strategy, which is shared by the previously dominant DCF paradigm. Integrating the strategy to the DiMP [21] tracker achieves a new state-of-the-art performance on six datasets.

2.2. Siamese-Based Trackers

The other dominant tracking framework is Siamese-based trackers. By learning massive data on pairs of video frames offline and employing an effective similarity function, Siamese network-based trackers have made much progress and achieved the best balance between accuracy and efficiency [17]. As pioneering work, [9] first introduced SiamFC framework for visual object tracking, by using Siamese networks to measure the similarity between target and search image [8]. Reference [12] further proposed the CFNet architecture and received higher performance by integrating correlation filter to the Siamese networks for tracking.

The trackers, such as SiamFC, CFNet, and DSiam, use multiscale bounding boxes to locate the target object in the search area, resulting in a large amount of calculation and limited performance. Reference [15] proposed SiamRPN by applying region proposal network (RPN) into Siamese architectures to reduce calculation. Reference [14] further designed a novel distractor-aware module for incremental learning and applied the module to the SiamRPN, referred to as DaSiamRPN. Due to some limitations as translation invariance, the above trackers built their networks upon an architecture similar to AlexNet. Reference [10] overcame the restrictions of strict translation invariance and proposed SiamRPN++ with a deep network ResNet. By introducing layerwise and depthwise cross correlations to learn sufficient semantic information and produce multiple similarity maps at multiple levels, tracking performance and robustness had been highly improved. The above Siamese trackers all employ anchor-based object detectors. In addition, some anchor-free trackers are proposed for higher performance, such as SiamBAN [16], SiamCAR [17], and SiamFC++ [13]. Furthermore, [22] proposed an accurate bounding-box regression with distance-IoU loss to optimize the objective function to make the target estimation more accurate.

2.3. Template Update

It is important to possess the online model adaptability for the Siamese trackers. However, most Siamese trackers adopt the fixed-model trained offline, which cannot cope with appearance changes in tracking scenarios. While some Siamese-based trackers introduce update strategies to update the model online for robust tracking, the update strategies are insufficient to adapt to the changes of different tracking scenarios. DSiam [23] adopt a fast transformation learning module to online update the tracking model. In [3], a learned update strategy is proposed, which use the initial template, the accumulated template from all previous frames, and the feature template in the current frame. In [24], the current template will be replaced when the thresholds of the dynamic template update interval and the confidence threshold are satisfied. Although template update methods are introduced into some Siamese trackers, the update strategies are either simple template replacement or use all historical templates without selection. The template representation will be weakened without selecting strategy of historical templates since some templates have lower similarity with the search region. In this work, we propose selecting strategy of historical templates and adaptive template update methods. Using the selected historical templates as the input, a trained subnetwork can generate the different weight matrixes to obtain the current tracking template by matrix multiplication.

2.4. Attention Mechanism

It is a core problem to employ the attention mechanisms to explore the inherent spatial and temporal relations in the video sequence in object tracking. The spatial relations contains object appearance information, which help localize the target. The temporal dependencies include the state changes of objects across video frames, which help cope with the challenging scenes such as occlusion, scale variation, and object deformation. Some trackers have introduced attention mechanisms to enhance the template representation and improved the tracking robustness. However, some previously Siamese trackers [6, 10, 13, 17, 25] only exploit the spatial relations for tracking, and some methods [3, 21, 26] only explore the temporal relationships by updating the model with historical predictions. In this work, we consider spatial and temporal information simultaneously to leverage them for robust tracking.

3. Siamese Attention Networks with Adaptive Templates

In this section, we describe our proposed tracking architecture SiamAttnAT, with an overview in Figure 2. And then we elaborate the main components of the proposed network, including backbone, template selection module, memory pool, and template generation module.

3.1. Backbone

We employ the modified ResNet50 like [10] as the backbone of SiamAttnAT architecture to extract the multilevel features of search regions and target templates, respectively. Features from different blocks of the backbone focus on different hierarchical information of target object. Features from earlier layers contain shallow semantic information such as shape and colour, which help locate the target object. Features from latter layers have richer and more abstract semantic information, which is beneficial to cope with the challenging scenarios such as occlusions, deformation, and target disappearance in visual object tracking.

The backbone consists of two subbranches: a template branch and a search branch. The template branch is used to extract features from the target patch z, and the search branch is for extracting features from the current search patch x. The two branches share parameters in CNN [15]. In this work, we use the block 3, block 4, and block 5 of the ResNet50 to extract multilevel features of the target template and the search region. The sizes of template patch and search region are 127 × 127 × 3 pixels and 255 × 255 × 3 pixels.

3.2. Memory Pool

We construct memory pool to store historical feature templates. By fusing the different feature templates, we capture the temporal dependencies across video frames to enhance the feature representation of the current template, improving the tracking robustness. The memory pool is composed of three First-in First-out (FIF0) queues (i.e. ,). The size of queue is a hyperparameter. In this work, we set the queue size to 4. If the template selecting strategy is satisfied, we first crop the template image at a size of 127 × 127 × 3 pixels from the current frame, then feed the new template image into ResNet50 network. We store the new feature templates generated from block 3, 4, and 5 of the backbone to the template queues. In general, the latest feature template has the most advantages to reflect the state of the current object. So we adopt FIFO strategy to update the template queues. Considering that the initial template is given in the first frame containing only ground-truth information, while other templates are based on predictions. For this reason, we retain the initial template throughout the tracking process.

In SiamAttnAT framework, for convenience, we denote and as the output feature maps of and x from (i is 3, 4, 5, denotes the block 3, 4, and 5 of the ResNet50) in the backbone, where j represents the j-th target template in the template queue. In particular, denotes the initial feature templates of different layers given in the first frame and could not be deleted throughout the tracking process.

3.3. Template Generation Module

Template generation module can adaptively generate the target template online to achieve more accurate and robust tracking performance. The module take as input the initial object template and the historical templates stored in the template queues. Since linearly combined these templates leads to information decay over time, we adopt learnable weighted sum to aggregate the templates. We first fuse the historical templates by constructing a spatial attention model to generate weighted matrixes, and then we adopt fixed weighted sum to fuse the initial template and the output of spatial attention module. Due to the most reliable signal provided the initial template, we use a parameter to denote the initial template weight.

The spatial attention module is composed of four nonshared full convolution networks, each of which contains two consecutive 3 × 3 convolutions, one 1 × 1 convolution, and a sigmoid activation function. The spatial attention model generates four weight matrixes, denotes as (j is from 1 to 4). The output of the template generation module () is calculated as

When tracking on a new frame of video sequence, the candidate image (named as x) of the current frame is fed into the search branch of the backbone to generate the features. When we feed the feature maps and into corresponding RPN network individually, the RPN network outputs classification feature (referred to as ) and the regression feature (referred to as ), respectively. We denote trainable parameters and as classification weights and regression weights of the i-th convolutional layer. The fusion outputs of each RPN modules are combined by weighted-fusion layer and calculated by weighted sum as follows:

3.4. Template Selection Module

Which historical templates should be memorized is the core of dynamically generating the current template. In our work, we adopt the confidence score between the current template and the search region to decide whether to update the memory pool.

The selecting strategy is described as follows: we denote (i is 3, 4 and 5) as the i-th layer confidence threshold and as the final similarity threshold. When the conditions of and are satisfied, we crop the template image at a size of 127 × 127 pixels from the current frame, then feed the new template image into ResNet50 network. The ResNet50 network output new feature maps from the 3, 4, and 5 block. We update the i-th template queue using the i-th layer feature maps with the FIFO strategy.

The selected historical feature map is the target feature map with high accuracy. But if we overly focus on the correctness and set the threshold of access to the memory to a high level, it will cause the memory pool to update slowly, and all the feature maps in the memory are highly similar. Therefore, the balance between accuracy and speed is essential for selecting strategies. Our strategies are as follows.

We determine the relationship among the thresholds for accessing the memory through a large number of experiments, as shown in equation (3).

We find that the deep features representing high semantic information must be highly consistent with the first frame to make the template accurate. In order to ensure the high diversity of the memory, the shallow features that characterize appearance information such as shape and colour require a lower threshold to maintain target with various appearance changes.

In addition, the confidence is still high after tracking the wrong target, but the score between two consecutive frames will suddenly change when there is semantic interference in the case of a clean background. The sudden change of two frames is also judged as a situation with low confidence and the feature is not allowed to enter the memory pool in this case. We denote hyperparameter as the change threshold of two consecutive frames, as shown in (4).

We denote as the classification result of the i-th layer and the k-th frame through the RPN module.

4. Experiments and Results

We conduct experiments on five short-term and long-term benchmark databases: VOT2018 [27], OTB-100 [28], UAV123 [29], NFS [30], and LaSOT [31] datasets. Furthermore, we conduct ablation study to verify the effects of each proposed module.

4.1. Datasets
VOT2018. The VOT [27] dataset has been updated every year since 2013, which is composed of high-resolution colour sequences [14]. VOT2018 is a widely used dataset for online free single object tracking. It consists of 60 challenging videos collected from real-life datasets.OTB-100. OTB-100 [28] is the most authoritative and widely used benchmark dataset. The dataset contains 100 video sequences which is extended from the OTB-2013 (50 videos) including a quarter of the grayscale images.NFS. NFS [30] is higher frame rate video dataset (called Need for Speed - NfS) and benchmark for visual object tracking. The dataset consists of 100 videos (380K frames) captured with now commonly available higher frame rate (240 FPS) cameras from real-world scenarios.UAV123. The UAV123 [29] dataset is proposed in 2016 which contains a set of video sequence for drone tracking. It contains 123 high-resolution aerial video sequences annotated, totaling more than 110K frames [14].LaSOT. LaSOT [31] is a large-scale dataset which contains 1400 video sequence in total, with more than 3.5 M image frames, and 280 videos in the testing set in total. Since such challenging scenarios as occlusion, deformation, target disappearance are common in LaSOT, it is widely used into performance evaluation for long-term object tracking.
4.2. Implementation Details
4.2.1. Experimental Environment

The operating system used for the experiment is Ubuntu 18. The experiment is implemented using PyTorch, and 2 11 GB NVIDIA GeForce RTX 2080Ti GPUs to conduct the experiments.

4.2.2. Training

We use SiamRPN++ [10] as baseline. In proposed SiamAttnAT network, we use ResNet50, which is pretrained on ImageNet, as the backbone of tracking architecture. The whole network are trained and fine-tuned on the training sets of COCO, ImageNet DET, ImageNet VID, YouTube-BoundingBoxes Dataset, and LaSOT. In addition, we use an image of 127 × 127 pixels as target template and an image of 255 × 255 pixels as candidate image in training and testing.

Our model is trained in an end-to-end fashion. We employ stochastic gradient descent (SGD) as the optimizer with weight decay of 0.0005 and momentum of 0.9. The learning rate is exponentially decayed from 0.005 to 0.0005. By following SiamRPN [15], the training loss is the sum of classification loss and regression loss. We adopt the cross-entropy loss for the former and smooth L1 loss for the latter.

4.2.3. Inference

The first template provides the most reliable information, while the latest template provides the most similar appearance to the target in the current frame. In our work, we set the weight of the first template to 0.5. Furthermore, we set hyperparameters , , , , to 0.98, 0.985, 0.99, 0.985, and 0.1, respectively. Comparison with State-of-the-Art Methods.

4.2.4. On OTB-100

We draw some experimental comparisons of the proposed SiamAttnAT and other 9 state-of-the-art trackers including SiamFC [9], ECO [32], SiamRPN [15], DaSiamRPN [14], SiamRPN++ [10], ATOM [26], DiMP [21], SiamBAN [16], and GradNET [11] on this dataset. We adopt success plots and precision plots as the evaluation indicators of tracking performance on OTB-100. Figure 3 illustrates comparisons of the success and precision plots over all 100 testing videos. As shown in Figure 3, SiamAttnAT outperforms all other trackers for both indicators. Especially, our tracker achieves a success score of 0.715 and a precision score of 0.928. Compared with the baseline SiamRPN++, our tracker has improved success score and precision score by 1.9% and 1.4%, respectively, verifying the effectiveness and efficiency of the proposed tracker.

4.2.5. On UAV123

To evaluate the tracking performance of the proposed tracking framework, we report some experimental comparisons of our tracker and several other state-of-the-art trackers including Staple [33], SRDCF [20], SiamFC [9], ECO-HC [32], ECO [32], SiamRPN [15], DaSiamRPN [14], SiamRPN++ [10], and SiamCAR [17]. Figure 4  illustrates the precision and success scores of the compared trackers on UAV123 dataset. Specifically, our tracker achieves an AUC score of 0.636, which surpasses that of DaSiamRPN (0.586) and SiamCAR (0.614) by 5.0% and 2.2%, respectively. Compared to the baseline SiamRPN++, our tracker has improved success score and precision score by 2.3% and 4.6%.

4.2.6. On VOT2018

We conduct our experiment on VOT2018 dataset [27] in comparison with 10 different methods including UPDT [34], ECO [32], DaSiamRPN [14], DRT [35], RCO [27], SiamRPN [15], ATOM [26], SiamRPN++ [10], and DiMP [21].

Following the evaluation protocol of VOT2018, we use the expected average overlap (EAO), accuracy, and robustness as evaluation indicators of different trackers. Table 1 presents the comparison results in EAO score, accuracy score, and robustness score with almost all the top-performing trackers in the VOT2018 benchmark. As shown in Table 1, our tracker has achieved good results in EAO and accuracy. In comparison with the baseline tracker SiamRPN++, our approach gains of 3.2%, 1.8%, and 3.3% on EAO, accuracy, and robustness, respectively.

4.2.7. On NFS

We evaluate the proposed tracker on the 30 fps version of the NFS [30] dataset. As shown from Table 2, our tracker achieves better performance. Compared with the baseline SiamRPN++, AUC score is improved by 2.3%.

4.2.8. On LaSOT

We conduct experiments on LaSOT [31] to further validate the long-term performance of proposed tracker on a large-scale database. As shown in Table 3, our method attains the best normalized precision and AUC score. Compared with SiamRPN++, our tracker has improved AUC and normalized precision by 6.7% and 5.4%, respectively.

4.3. Ablation Study

We provide detailed ablation studies about the different strategies including historical templates selection, template queue size, and the fusion methods of the historical templates to analyze their effects to the tracking performance. To achieve better balance of the different factors, we conduct extensive studies. All the ablation experiments are conducted on the VOT2018 benchmark.

4.3.1. Effect of Historical Templates Selecting Methods

Historical template selection method is the main components of our proposed tracker. We perform extensive experiments to analyze the effects of it to the tracking performance.

We construct a template selection module to determine which template can be stored in the memory pool. Besides our selecting strategy, there are two other methods including random selection and continuous selection. The former is to randomly select the multiple templates from the last 10 historic video frames, and the latter is to continuously select them. We set template queue size to 4 and perform numerous experiments to explore their effects to the tracking performance.

Table 4 displays the results comparisons of our tracker with different selecting strategies. From the table, we can observe that the tracker with our selecting strategy performs better than the tracker with continuous selection and random selection. Compared with the random selection, the tracker with continuous selection has slightly better performance.

4.3.2. Effects of Spatial Attention Module

There are two strategies to aggregate the historical templates in the template queues: fixed weighted sum and learnable weighted sum. To study the effects of the two strategies to the tracking performance, we conduct experiments on VOT 2018 benchmark. We set the fixed weights of four templates in the template queue (f1i, f2i,f3i, and f4i, i is 3, 4, and 5) as 0.05, 0.1, 0.15, and 0.2, respectively. The latest template to the current tracking frame is assigned the greater weight. The attention module composed of convolution neural network generate the learnable weight to fuse the template. Table 5 shows the comparison results of different fusion strategies in terms of accuracy, robustness, and EAO.

As shown in Table 5, the tracker with fixed weighted sum or with learnable weighted sum achieves better results than the baseline SiamRPN++. Compared with the fixed weighted sum, the tracker with attention module has improved accuracy, EAO, and robustness by 0.8%, 0.9%, and 2.1%, respectively. It can be interpreted that spatial attention can enhance the representation capability of the template to help improve the tracking performance.

5. Conclusions

In this work, we present a new tracking framework named Siamese Attention Network (SiamAttnAT). We propose a new method to update template during tracking, which is composed of a historical template selecting strategy and a template dynamically generating approach. More specially, our method fully utilizes the reliable prediction of the model to select the historical templates actively for generating the current template subsequently. Compared with existing template update methods, our strategy gets rid of treating all historical templates equally or simply updating template with the fixed interval and thus enhances the representation capability of the template to improve the tracking performance of the model. Extensive experiments on five short-term and long-term benchmarks demonstrate that our method achieves competitive performance, while running at a real-time speed. In future, we will conduct further studies to reduce the limitations of the tracker on certain aspects of tracking video, such as slow motion, fewer target disappearance, and fewer similar distractors.

Data Availability

All data included in this study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.