CNN with Embedding Transformers for Person Reidentification

<div>Architecture of CET. First of all, the input image X passes through TIC to get the feature map F, and F goes through PGAP to get the component feature Pt, which then passes through a linear projection. Then, the input of transformer encoder Z is obtained by embedding global vector GTE and position encoding PE. The output of transformer encoder O is obtained by Z. Finally, O is divided into two branches structure called FFLV and TBL, respectively, which are used to fuse the two losses. (a) CNN with embedding transformers (CET architecture); (b) transformer encoder.</div>

Mathematical Problems in Engineering

CNN with Embedding Transformers for Person Reidentification

Figure 3