PointTransformer: Encoding Human Local Features for Small Target Detection

<div>The overall architecture of our proposed model. Attention backbone (A) utilize a trained pose estimation model to reconstruct local features based on transformer encoder. Gated position embedding module (G) uses human skeletal point location information to enhance local feature learning. Head-layer module (H) reconstructs the output layer by weighting the positional encoding feature maps.</div>

Computational Intelligence and Neuroscience

fig2

Figure 2

Figure 2: PointTransformer: Encoding Human Local Features for Small Target Detection