Computational Intelligence and Neuroscience

Research Article

Classification of Diabetic Retinopathy Severity in Fundus Images Using the Vision Transformer and Residual Attention

Training the classification of the DR model.

	Require: Fundus Images and Labels (X, Y), where Y = { y/y ∊ {0, 1, 2, 3, 4}}
	Input: fundus images x ∊ X
(1)	Initialize the network parameters
	//Feature Extraction Block (FEB)
(2)	Image division Patch, that is, x is divided into 9 patches of fixed size.
(3)	Linear Projection of Flatted Patches, which flattens the patch into a row vector and maps it to the specified dimension through a Linear Projection.
(4)	Patch + Position Embedding, which generates a CLS token, then splices it to the input path embedding and generates position information for each patch. Patch + Position Embedding is added directly as a new input token.
(5)	Transformer Encoder, repeat stacking Encoder Block L times for image feature extraction.
	//Grading Prediction Block
(6)	For the extracted feature matrix I, multiple fractional tensors are generated via different 1 × 1 convolutions.
(7)	These class features are fused by average pooling.
(8)	For the fused features, the classification result is obtained through an FC classifier.
	Output: Trained model predicts probability class corresponding to ∀y for an input x