Research Article

Multisemantic Level Patch Merger Vision Transformer for Diagnosis of Pneumonia

Figure 4

The input image is split to several patches and then linearly embedded; then, the position embedding is added to the result sequence. The sequence is fed to several Transformer encoders. An extra learnable classification token is also added into the input sequence.