Classification of Diabetic Retinopathy Severity in Fundus Images Using the Vision Transformer and Residual Attention
Algorithm 1
Training the classification of the DR model.
Require: Fundus Images and Labels (X, Y), where Y = { y/y ∊ {0, 1, 2, 3, 4}}
Input: fundus images x ∊ X
(1)
Initialize the network parameters
//Feature Extraction Block (FEB)
(2)
Image division Patch, that is, x is divided into 9 patches of fixed size.
(3)
Linear Projection of Flatted Patches, which flattens the patch into a row vector and maps it to the specified dimension through a Linear Projection.
(4)
Patch + Position Embedding, which generates a CLS token, then splices it to the input path embedding and generates position information for each patch. Patch + Position Embedding is added directly as a new input token.
(5)
Transformer Encoder, repeat stacking Encoder Block L times for image feature extraction.
//Grading Prediction Block
(6)
For the extracted feature matrix I, multiple fractional tensors are generated via different 1 × 1 convolutions.
(7)
These class features are fused by average pooling.
(8)
For the fused features, the classification result is obtained through an FC classifier.
Output: Trained model predicts probability class corresponding to ∀y for an input x