Research Article

PF-ViT: Parallel and Fast Vision Transformer for Offline Handwritten Chinese Character Recognition

Table 2

A summary table of vision transformer related work.

Algorithm nameBrief methodologyHighlightsLimitations

CrossViT [35]The architecture consists of a stack of K multiscale transformer encoders. Each multiscale transformer encoder uses two different branches to process image tokens of different sizes and fuses the tokens at the end with an efficient module based on cross-attention of the CLS tokens.The dual-branch transformer combines image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features.The model increases in FLOPs and model parameters.

ViT and GCN [36]Firstly, the scene image is divided into patches, and the positional encoding and vision transformer are used to encode the patches. Consequently, the long-range dependencies can be mined. On the other hand, the scene image is converted into superpixels.Computing efficiency has been significantly improved.The dataset is complex. The model is designed for higher-resolution vision tasks.

ViT [1]First, the images under analysis are divided into patches, then converted into sequences by flattening and embedding. To maintain information about the position, the embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers to generate the final representation.To boost the classification performance, the authors explore several data augmentation strategies to generate additional data for training.The number of model parameters is large.

Vision transformer [37]In this study, for the first time, authors utilized ViT to classify breast US images using different augmentation strategies. The results are provided as classification accuracy and area under the curve (AUC) metrics.The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in the classification of US breast images.The authors use the strategy of transferring pretrained ViT models for further adaptation.

Convolutional vision transformer (CvT) [38]This is accomplished through two primary modifications: a hierarchy of transformers containing a new convolutional token embedding and a convolutional transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture.The model has fewer parameters and lower FLOPs. 

Re-attention [39]The model can regenerate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The proposed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modifications to existing ViT models.The model has minimal computational and memory overhead.