Research Article

Online Multiplayer Tracking by Extracting Temporal Contexts with Transformer

Figure 1

The transformer-based framework of the proposed. The CNN module is used to extract the features of the input frame. The global feature maps of the previous frame and the current frame are fed to the encoder, and then, the combined global feature map of two consecutive frames is input into the decoder as a common key. The temporal mask is beneficial for suppressing the background changes transformed from the previous frame temporally and concentrates on the target player. The object detection features of the current frame and the tracked object features of the previous frame are input into two decoders with a shared structure, and then, we obtain detection boxes and tracking boxes. Finally, IoU matching is employed to associate detection boxes with tracking boxes.