Research Article

Classification of Diabetic Retinopathy Severity in Fundus Images Using the Vision Transformer and Residual Attention

Algorithm 1

Training the classification of the DR model.
Require: Fundus Images and Labels (X, Y), where Y = { y/y ∊ {0, 1, 2, 3, 4}}
Input: fundus images x ∊ X
(1) Initialize the network parameters
 //Feature Extraction Block (FEB)
(2) Image division Patch, that is, x is divided into 9 patches of fixed size.
(3) Linear Projection of Flatted Patches, which flattens the patch into a row vector and maps it to the specified dimension through a Linear Projection.
(4) Patch + Position Embedding, which generates a CLS token, then splices it to the input path embedding and generates position information for each patch. Patch + Position Embedding is added directly as a new input token.
(5) Transformer Encoder, repeat stacking Encoder Block L times for image feature extraction.
 //Grading Prediction Block
(6) For the extracted feature matrix I, multiple fractional tensors are generated via different 1 × 1 convolutions.
(7) These class features are fused by average pooling.
(8) For the fused features, the classification result is obtained through an FC classifier.
Output: Trained model predicts probability class corresponding to ∀y for an input x