Abstract

Rheumatoid arthritis (RA) is a progressive systemic autoimmune disease characterized by inflammation of the joints and surrounding tissues, which seriously affects the life of patients. The Sharp/van der Heijde method has been widely used in clinical evaluation for the RA disease. However, this manual method is time-consuming and laborious. Even if two radiologists evaluate a specific location, their subjective evaluation may lead to low inter-rater reliability. Here, we developed an efficient model powered by deep convolutional neural networks to solve these problems and automated the overall scoring on hand X-rays. The depthwise separable (Dwise) convolution technique is used based on ResNet-50 due to the high resolution of hand X-rays. An inverted residual block is introduced to devise a ResNet-Dwise50 model to enhance the efficiency of the model. The model was trained and tested using bilateral posteroanterior (two-handed, side by side) images of 3818 patients. The experiment results show the ResNet-Dwise50 model achieved an MAE of 14.90 and RMSE of 22.01 while ensuring high efficiency. There was no statistical difference between the average scores given by two experienced radiologists and predicted scores from our model.

1. Introduction

Rheumatoid arthritis (RA) is a chronic inflammatory autoimmune disease that primarily strikes the hands and feet joints, causing swelling, pain, and stiffness in the joints [1] and can lead to joint damage in the form of joint space narrowing and joint erosion [2]. Early diagnosis of RA is necessary because early treatment may prevent joints from worsening or slowing the process. The methods of diagnosis usually include physical exams, blood tests, and imaging tests. Radiography is currently the primary tool for evaluating RA because of its wide availability, relatively low cost, and high capability for imaging bones and joints [3]. The commonly used and well-validated method for evaluating RA in hand X-rays is the Sharp/van der Heijde (SvH) method [2]. The radiologist assesses joint space narrowing and joint erosion by grading the specific locations in each hand (wrist) and foot using the SvH method. However, this manual approach is time-consuming and highly subjective, resulting in low inter-rater reliability, and even trained experts often disagree on the final score. We therefore decided to devise a neural network that can automate SvH scoring in radiographs.

With the rapid development of artificial intelligence (AI) technology, deep learning has become a significant trend in medical image analysis. Convolutional neural networks have become the dominant machine learning method for medical image classification [4, 5], medical image segmentation [6, 7], medical image registration [8, 9], medical image fusion [10, 11], and medical image report generation [1214]. Some researchers have devised models to solve the RA scoring challenge by learning to assess joint injuries directly from data. One approach uses various clinical data to determine whether a patient has RA and diagnose rheumatoid arthritis by training a convolutional neural network on a dataset of radiographs [15, 16]. Another mainstream approach uses a two-step approach to detect finger joint destruction [17, 18]. The first step is a joints’ detection task, while the second step is an image classification that uses the convolutional neural network to classify the scoring of joint destruction.

Although these models have shown exemplary performance in some results, they are not ideal in clinical applications. The main reason is the scoring system evaluates specific locations, and it is challenging to detect these locations accurately. Some of these locations are so tight that even when one is detected correctly, it is unlikely to rule out interference from others, especially in carpal bones. Moreover, hand joints may be interlaced in a severe patient with RA (Figure 1). The prospect of training a model to predict overall scores is therefore considerable.

We adopted a similar methodology and proposed an efficient overall scoring framework of RA, which we refer to as ResNet-Dwise50. The model uses a convolutional neural network to extract the features of the hand image and then uses the fully connected layer to predict the overall score. It is an end-to-end model that performs regression based on overall scores. Traditional image classification adopts low resolution (most using 224 × 224), while the medical image analysis generally requires high resolution. Training high-resolution images requires tremendous computing resources. We used a depthwise separable convolution [19] to reduce the number of parameters and multiply-adds (MAdds) required to train high-resolution images. The experiment showed that this trick increases training speed but at a loss of accuracy. We therefore introduced an inverted residual block that is an added expansion layer before the depthwise layer resulting in minimal loss of accuracy while maintaining efficiency. In summary, our contributions are as follows:(i)An efficient CNN model (ResNet-Dwise50) was designed for the overall scoring of RA in hand X-ray datasets by introducing the techniques of depthwise separable convolution block and inverted residual block(ii)The hand X-rays of 3818 patients with RA were collected and evaluated by experts using the SvH method(iii)A new loss function was used to solve the imbalance of Sharp scoring

2.1. Sharp/van der Heijde (SvH) Method

The Sharp/van der Heijde (SvH) method [2] quantifies erosion and joint space narrowing (JSN) assessment. The method assessed the joints in both hands (all MCPs, 1st IP, all PIPs, scaphoradial, lunate radial, distal ulna, and trapeziometacarpal) and feet (all MTPJs and 1st IPJ). Sixteen areas from each hand and six areas in each foot are included in the erosion score range from 0 to 5. JSN is assessed at 15 areas from each hand and six areas from each foot ranging from 0 to 4. Therefore, the total SvH scores range from 0 to 448. This paper only used the SvH method to assess the hands; overall scores thus ranged from 0 to 280.

2.2. Convolutional Neural Network

Medical imaging is the process of imaging the interior of a body for clinical analysis and medical interventions. Recent advances in deep learning, especially convolutional neural network (CNN) technology, have brought many breakthroughs to medical image analysis. CNNs can extract image features by their ability to learn features and have been successfully used in medical image classification and detection of lesion areas and medical image segmentation. For example, the VGG, Inception, and ResNet models have been used in medical image classification. The Faster R-CNN and YOLO models are used in the detection of lesion regions. FCN and U-Net models are used in medical image segmentation. The existing radiographic work of rheumatoid arthritis patients mainly focuses on automatic SvH scoring [20], segmentation, and classification [21]. For example, Rohrbach et al. [20] used a CNN to automatically predict bone erosion scores on X-ray images of patients with RA.

2.3. Depthwise Separable Convolution

A CNN can automatically detect important features without human supervision and perform well in image analysis. However, due to the limited hardware of some devices, the computation resources of the neural network still need to be reduced. Google’s MobileNet [19] and Xception [22] are optimized by depthwise separable convolution, a technique that factorizes a standard convolution into a depthwise and a pointwise convolution. First, the depthwise convolution applies a single filter to each input channel. The pointwise convolution then uses a 1 × 1 convolution to combine the outputs of the depthwise convolution. In the standard convolution, both filters combine the inputs into a new set of outputs in one step. This factorization has the effect of significantly reducing the computation time and the model size. Depthwise separable convolution has been a key technology in many efficient neural network modules [2325].

2.4. Transfer Learning

Transfer learning [26] emerged as a technique in developing deep learning models to apply to various domains. In this method, the convolutional neural network model is trained in two stages: (1) pretraining, where the network is generally trained on a large-scale labeled dataset, e.g., the CheXpert dataset [27]; (2) fine-tuning, where the pretrained network is further trained on a specific target dataset, which has fewer labeled samples than the pretraining stage. This technique has become extremely helpful in many settings, particularly in medical imaging. In our experiments, we chose this technique to fine-tune our model. We used the pretrained model parameters in CheXpert to initiate our model and then further trained on our Sharp score dataset.

3. Materials and Methods

3.1. Preprocessing and Dataset Planning
3.1.1. Dataset

To our knowledge, there is currently no RA open-source dataset on SvH scores, so we collected a private RA dataset. The dataset contains a total of 3,818 sets of X-rays. In the past two years, we collected X-rays of RA patients or suspected RA patients who visited the Department of Rheumatology of the First Clinical Medical College of Anhui University of Traditional Chinese Medicine. Two experienced radiologists were invited to score these X-ray images. The scoring rule was that both radiologists read the X-ray images in a hidden chronological order and score the patients’ hand JSN and erosion of each group according to the SvH method. Finally, we took the average of the two scores to get the Sharp score of each X-ray image. In this dataset, 2700 images were randomly selected for training, while 760 images were randomly selected from the remaining images for verification, and the last 158 images were used for testing. There was no image overlap between the training set and the test set. Figure 2 shows the frequency distribution of Sharp scores in the RA dataset.

3.1.2. Data Augmentation

It is essential to augment the training data to improve the model’s robustness, especially in insufficient labeled medical image data. We utilize a similar approach in [30] to achieve data augmentation by the following steps:(1)The original image is resized by its shorter side in the range of scale of 0.9-1.0 for sampling(2)The 1024 × 1024 cropping is randomly sampled from the resized image or its horizontal flip(3)Randomly change the brightness, contrast, and saturation of the cropped image, all of which are set to 0.1(4)Randomly rotate the image by one of the given angles: [0°, 90°, 180°, 270°](5)Normalize a tensor image with mean and standard deviation calculated by the training dataset

The score distribution of our dataset is roughly exponential. As we have tried, using these scores directly for training is hard to convergence. So, it is essential to standardize the score before feeding it into the network. We chose the z-score method to normalize our scores, achieving a promising result.

3.2. Deep Learning Model
3.2.1. Overall Pipeline

The pipeline of the overall scoring of RA is shown in Figure 3. We use a customized depthwise separable convolutional neural network to extract a feature matrix, followed by a fully connected layer to predict a score, and then use a loss function to regress the score. In the depthwise separable convolution, standard convolutional layers are divided into two different levels. Depthwise convolution executes convolution in a single depth slice, while pointwise convolution merges the information over the entire depth. The ResNet-Dwise50 model used the inverted residual block with depthwise separable convolution, enhancing the information flowing in the model.

3.2.2. Customized Convolutional Neural Network

Considering the simplicity and high efficiency of residual networks, we use it as our benchmark architecture. He et al. [28] found that full preactivation is better than original activation and better than all other types of activation. He et al. [29] added a 2 × 2 average pooling layer with a stride of 2 before the 1 × 1 convolution in the downsampling block. The tweak is called ResNet-D and is illustrated in Figure 4. This method is effective in application and only has a slight influence on calculation cost. We therefore adopted full preactivation and ResNet-D as the benchmark architecture to improve the model performance.

The residual network consists of multiple residual blocks. We formulate the residual block at first. With as its input, the output of the block is recursively defined as the following equation:where is the residual block’s forward propagation, often 2 or 3 stacked convolution layers. Each layer consists of a sequence of convolutions, batch normalizations [24], and rectified linear units (RELUs). For the block of two stacked convolution layers, is defined as the following equation:where and are trainable parameters, denotes the convolution, is the batch normalization, and .

3.2.3. ResNet-Dwise50 Model

Although the standard residual network performs well, due to the high resolution of medical images, the model training consumes a large amount of memory and reduces the training efficiency. Depthwise separable convolution is used instead of standard convolution in this paper to reduce the parameters and training time of the neural network model. We also introduced inverted residual blocks [23] shown in Figure 5 to improve the model’s efficiency that are based on the depthwise separable convolution.

Suppose a standard convolutional layer takes a input feature map F and produces a feature map G where is the width and height of a square input feature map, is the width and height of the square output feature map, is the number of input channels, and is the number of output channels.

In the standard convolutional layer, the convolution kernel K is parameterized by a matrix, where is the spatial dimension of the kernel and and are the previously defined input and output channel number, respectively. The output feature map of the standard convolution (stride is 1 with the same type) is calculated as the following equation:

So, the calculation cost of the standard convolution is shown in the following equation:

The depthwise separable convolution consists of two layers: depthwise convolutions and pointwise convolutions. The depthwise convolutions apply one filter to each input channel, and the pointwise convolutions use 1 × 1 convolution to create a linear aggregate of the output of the depthwise layer. The output feature map of the depthwise convolution can be written as the following equation:where is the depthwise convolution kernel of and is the output feature map. The filter in the kernel is used to the channel in input matrix F to produce the channel of the output feature map .

So, the calculation cost of the depthwise separable convolution is shown in the following equation:which is the depthwise convolution and the pointwise convolution.

Dividing the standard convolution process into depthwise convolutions and pointwise convolutions separately can reduce the computation:

We formulate the total number of multiply-adds of the inverted residual block as shown in the following equation:where is the expansion factor. This expression has an additional term compared to equation (6) because we have an extra 1 × 1 convolution. However, the nature of this technique allows us to fully utilize minor input and output dimensions. When a bottleneck layer inputs an channel tensor and outputs an channel tensor, the expansion layer is channels.

We modified the residual block in the ResNet to the inverted residual block and used full preactivation in the depthwise separable convolution block. All layers are preceded by batchnorm and ReLU nonlinearity except for the final fully connected layer. The ResNet-D tweak was used in the downsampling block to enhance the information transferring. The final average pooling reduces the spatial resolution to 1 before the fully connected layer. The ResNet-Dwise50 model is defined in Table 1. For all experiments using reverse residual blocks, we chose an expansion factor of 6. Counting depthwise as separate layers, ResNet-Dwise50 has 50 layers.

3.3. Loss Function

Class imbalance is ubiquitous in medical data. When the data categories in the training set are highly unbalanced, the efficiency of the prediction model will be significantly affected, resulting in more errors on the classes with fewer samples. Figure 2 shows the frequency distribution histogram of scoring, from which we can see that there is also a severe imbalance in our dataset. To mitigate the effect of this imbalance on the model, we devised a new loss function to treat the error of each sample equally regardless of being a member of majority or minority. We combine L1 and L2 loss advantages to get our smooth loss function and use the smooth loss to focus on complicated, false prediction examples.

For the overall prediction of Sharp scoring, we use the loss as the following equation:in which smooth is shown in the following equation:where , , and are factors that are defined manually.

Intuitively, the modulating factor can reduce the loss contribution from easy examples, and the receiving factor can extend the range in which an example receives low loss. The bias for L1 loss should not be too large and plays the role of receiving factor. There are two properties of the smooth loss that might be noticed. (1) When the predicted score of a sample is close to the true score, the L2 loss function is used to reduce the loss for well-predicted examples. The value defines how close it will be and the value that can regulate the loss reduction. (2) When the predicted score is far from the true score, the L1 loss function is used for hard examples. We found that , , and work best in our experiment.

4. Results and Discussion

4.1. Implementation Details

The assessment of Sharp scoring requires high resolution; we used 1024 × 1024 resolution in this experiment. The proposed module is implemented based on PyTorch 1.7. We initialize the weights as in [31] and train ResNet-Dwise50 from scratch on the hand X-rays with the SvH score dataset. We use stochastic gradient descent with a minibatch size of 4, and the learning rate starts from 0.001 and weight decay of 0.001 with a momentum of 0.9. We do not use dropout.

The model was trained on a workstation using the Ubuntu 16.04 operating system. The workstation had an Intel(R) Xeon(R) Silver 4114 CPU @ 2.20 GHz, 64 GB RAM, and two NVIDIA Tesla P4 GPUs.

4.2. Evaluation Metrics

We have used different metrics to evaluate the performance of the proposed method, including Pearson’s correlation coefficient , mean absolute error (MAE), and root-mean-square error (RMSE). These metrics are defined in equations (11)–(13):

We also calculated the t-test on predicted scores and true scores from the test dataset. We used the two-sided t-test for the null hypothesis that two related samples have identical average values—the test measures whether the average score differs significantly across samples. If we observe a significant value, for example, greater than 0.05, then we cannot reject the null hypothesis of identical average scores. If the value is smaller than the threshold, then we reject the null hypothesis of equal averages. We first evaluated differences in SvH scores given by two different radiologists using the above metrics. These differences are shown in Table 2. The results show that even based on the consistent cognition in the diagnosis of RA, there is still a slight inconsistency between given scores.

As described in Section 4.1, we used the mean of two SvH scores evaluated by two different readers as the true score. This study aims to measure the degree of agreement between the model’s prediction as one rater and the provided true score by two different readers as another rater.

4.3. Ablation Studies

We used the ResNet-50 model as the baseline (B) with a bottleneck building block. Using too many parameters can introduce much variability into the model, leading to overfitting. Considering our dataset is not big enough to settle the overfitting, we reduced the number of parameters. We used the depthwise separable convolution block (DSC) to replace the bottleneck building block for acceleration. The inverted residual block (IRB) added an expansion layer in front of the depthwise separable convolution block, and the results show that this modification improves generalization capability and makes the predicted Sharp scoring closer to the actual values. Table 3 shows that when we used DSC based on ResNet-50, the performance is somewhat reduced, but the training parameters are reduced to 1/8 of the original, and the training time is reduced by half. On this basis, we continued to replace the standard residual block with IRB, and the performance of the model is greatly improved while maintaining high efficiency. The ablation studies prove that each component in our architecture plays an important role and helps to improve performance.

4.4. Quantitative Comparisons

Table 4 summarizes the performance of different models in predicting overall scores and their criteria on the RA dataset. Our ResNet-Dwise50 provided the best performance with a good compromise with the smallest RMSE and computational cost. The ResNet-50 model does not perform as well as the lightweight network, indicating that too many parameters can lead to overfitting on the small dataset. The ResNet-Dwise50 model outperformed MobileNetV2, reducing the RMSE 3.51 scores, which may prove the validity of our introduction of the inverted residual block to the baseline model. It is worth noting that the values of all the models were greater than 0.05, so there was no statistical difference between the true score and predicted score, suggesting that these models were sufficient to predict the overall score accurately.

The small dataset may lead to overfitting or underfitting, and the accuracy of the neural network model trained from scratch is generally not high. We therefore used our proposed models pretrained on the CheXpert dataset and then reserved the parameters of the feature extractor for migrating to our specific task. The results of these models with transfer learning on our Sharp scoring dataset are shown in Table 5. We can see that transfer learning dramatically improves the training accuracy of all models.

To explain the advantages of our approach, we visualized four sets of sample images along with two radiologists and the predicted score of the ResNet-Dwise50 mode with transfer learning in Table 6. It can be easily seen that our approach can predict the overall score well, with a slight difference from the average score of the two radiologists.

To further visualize the model performance results, a sample of 100 patients was randomly selected from the test dataset, comparing true scores and predicted scores. Figure 6 shows the predicted results of our ResNet-Dwise50 model with transfer learning. It proves that our model has good performance in the overall scoring of RA. It is necessary to acknowledge that the error in predicting the Sharp scoring is different if such prediction is higher or lower than the true score. For example, the difference between 250 and 270 is not significant for a severe patient. However, the difference between 0 and 20 is significant for a milder patient. Future research may aim to find criteria to better predict and adjust the results according to the scoring strategies.

5. Conclusions

It is challenging to utilize deep learning to automatically assign SvH scores: (1) the number of training data is limited; (2) scores are severely unbalanced; (3) high-resolution images are required; (4) data labelling is based on subjective human scores, without real ground truth. We cooperated with the hospital to collect the hand X-ray radiographs of 3818 patients with RA, invited two experienced radiologists using the SvH method to evaluate the radiographs, and then took the average score as the ground truth. We proposed a smooth loss to address the imbalance of our dataset based on scoring distribution. An efficient ResNet-Dwise50 model for predicting the overall scoring of RA is proposed in this paper to improve training efficiency on high-resolution images. We replace the standard convolution with the depthwise separable convolution block to reduce the computation of the model and introduced the inverted residual block to reserve the information integrity in the bottleneck to improve the network’s performance. As a result, the ResNet-Dwise50 model outperformed the MobileNetV2 model with a similar speed. In the future work, we will find a better metric for predicting and adjusting the results according to the scoring strategy, continue to carry out automated research on the overall scoring of RA foot X-rays, and apply the research results in the clinic.

Data Availability

The data used to support the results of this study are available from the corresponding authors upon request. One can also download the dataset from https://drive.google.com/drive/folders/1O11ROJypqyBksrfYqTIjdXxdc6MBao27?usp=sharing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to acknowledge the Department of Rheumatology of the First Clinical Medical College of Anhui University of Traditional Chinese Medicine for the infrastructure and radiologist evaluation support. This work was supported in part by the University Synergy Innovation Program of Anhui Province under Grant GXXT-2020-015, Natural Science Research Project of Anhui Universities under Grant KJ2020A0392, and Natural Science Research Fund Project of Department of Education of Anhui Province under Grant 2020zrzd16.