Abstract

China has the second largest number of tuberculosis (TB) cases in the world, and the Xinjiang province has the highest TB incidence in China. Urumqi is the capital city of Xinjiang; good TB prevention and control in Urumqi can provide an example for other parts of Xinjiang, considering that predicting the TB incidence is the prerequisite of prevention and control; therefore, it is necessary to do a prediction study on TB incidence in Urumqi. In this paper, based on the data of TB incidence and air pollution variables (PM2.5, PM10, SO2, CO, NO2, O3) in Urumqi, the ARMA (1, (1, 3)) + model was established by time series ARMA model method, cross-correlation analysis, and principal component regression method, and its predictive performance was superior to that of the ARMA (1, (1, 3)) model based on TB historical data. The research idea of this paper was good, which can provide a reference for other researchers. The prediction of the ARMA (1, (1, 3)) + model can provide scientific help for TB prevention and control in Urumqi, China. During the analysis, it was found that the higher the concentration of O3, the higher the incidence of TB. This study suggests that people in Urumqi should pay more attention to the hazards of O3 and do a good job of personal protection.

1. Introduction

Tuberculosis (TB) is a chronic infectious disease with serious long-term health hazards; it is also called consumption disease and is caused by Mycobacterium tuberculosis, which mainly affects the lungs of the human body. If this disease is not found in time and the treatment is not thorough, it will cause serious harm to health, even cause respiratory failure and death, and bring heavy economic burden to patients and families [1]. China is one of the 22 countries with high burden of TB, and the burden of TB ranks second in the world [2].

Xinjiang Uygur Autonomous region is a province in northwest China, located in the center of Eurasia, latitude 34°22′∼49°33′N, longitude 73°41′∼96°18′E [3]. In recent years, the annual TB incidence in Xinjiang has always been the highest in China, and is much higher than 2 to 3 times the national level. Urumqi is the capital city of Xinjiang (its position is shown in the red part of Figure 1); its average annual TB incidence is 64.33 from 2014 to 2017.

Due to the better per capita economic, medical, and environmental health conditions, the prevention and control of TB in Urumqi are much better than those of other areas of Xinjiang; therefore, the TB incidence is relatively low, but the prevention and control work of TB in Urumqi cannot be relaxed.

The prediction of infectious diseases is an important health work, which can detect the development trend of diseases as early as possible and increase the predictability of epidemic prevention work and help making health decisions [4]. ARMA model method is a popular prediction method in recent years, it can reveal the quantitative relationship between the research object and other objects with the development and change of time, this method has been widely used in the prediction of infectious diseases and has achieved successful results [58]. Although the ARMA model has good prediction performance, it only forecasts according to the historical incidence data and does not take into account some factors that may affect the incidence of the disease. Many studies have found that the infectious disease is related to air pollution variables [913]. Proper consideration of air quality factors can not only improve the accuracy of prediction, but also improve the scientific nature of prediction. General linear regression model (GLM) can take into account the influencing factors of prediction variables [14], so as to improve the accuracy and scientific nature of prediction. In regression analysis, if there is a high correlation between independent variables, it will lead to pseudoregression phenomenon. In order to avoid the false regression problem, principal component regression is a good choice; it can effectively avoid the correlation between variables, establish a real multiple regression model, and thus give a scientific prediction [1518].

In this study, first, the ARMA model method was used to establish the prediction model of TB incidence in Urumqi based on the data of monthly TB incidence from January 2014 to December 2017; second, considering that the TB incidence may be affected by air-quality factors, cross-correlation analysis was used to analyze the effect of air pollution variables on TB incidence and to improve the prediction accuracy of the ARMA model, some of the variables that affected the TB incidence were incorporated into the ARMA model. Then, considering the high correlation between the independent variables of the new model, this study used principal component regression to do analysis, which can avoid pseudoregression, and finally, a prediction model with air pollution variables and high prediction accuracy was established.

2. Materials and Methods

2.1. Data Source

The data of the monthly TB incidence (per 100,000 populations) in Urumqi from January 2014 to December 2017 were obtained from center for Disease Control and Prevention (CDC), Urumqi, China. The data of the air pollution variables in Urumqi from January 2014 to June 2018 were obtained from Urumqi Meteorological Bureau, the six air pollution variables are PM2.5, PM10, SO2, CO, NO2, and O3.

2.2. Model Descriptions
2.2.1. ARMA Model Description

The auto regression moving average (ARMA) model, also known as Box–Jenkins model, includes two different parts, the auto regression part, and the moving average part [19]. The auto regression moving average process can be expressed as ARMA (p, q). The specific expressions is as follows:where is white noise, c is a constant, are the coefficients of the autoregressive model, are the coefficients of moving average model.

If some values of autocorrelation coefficients and moving average coefficients in the ARMA (p, q) model are 0, then the model is called sparse model; for example, if is 0 in ARMA (1, 3) model, then, the ARMA (1, 3) model becomes sparse model ARMA (1, (1, 3)).

In general, ARMA modeling includes five steps.

First step: ARMA modeling needs data stability, Augmented Dickey–Fuller Test (ADF) is used to test the stability of the data, and the test level is 0.05.

Second step: by the analysis of autocorrelation and partial autocorrelation functions (ACF and PACF), the model can be identified, and the values of tentative p, q are determined.

Third step is to estimate and test the parameters of the model and calculate the AIC and SC values of the model; the smaller the AIC and SC, the better the performance of the model.

Forth step is to do model residual analysis by Box–Jenkins Q test, if the residual sequence of the model is white noise, the prediction stage begins; otherwise, it is shown that the information extraction is not sufficient, and some methods need to be considered to improve the accuracy of the model.

Fifth step is using the established model to predict the future change trend of the sequence.

2.2.2. Principal Component Regression Prediction

General linear regression models (GLMs) are also often used for prediction; when the correlation between independent variables is high, the GLM model will appear pseudoregression phenomenon [2023]. One of the commonly used methods to deal with this phenomenon is principal component regression; its modeling steps are as follows.

First step is to analyze the correlation between independent variables; if the correlation is significant, move on to the second step.

Second step is to extract principal components (PCs) from all independent variables; the number of PCs is determined by the cumulative contribution rate of eigenvalues. Generally, the cumulative contribution rate needs to be more than 85%, and its corresponding eigenvalues are extracted. The correlation between the PCs is zero.

Third step is using the reserved PCs as independent variables to build the GML model and then to estimate and test the parameters of the model. Taking as the parameter has significant statistical significance. R2 is used to measure the goodness of fit of the model.

Fourth step is to fit and predict dependent variables based on established models.

2.3. The Indexes of Assessing Forecast Accuracy

In this study, we use mean absolute error (MAE) to measure the prediction accuracy. The smaller the MAE is, the better the model is, and the higher the prediction accuracy is.

The MAE is defined aswhere is the predicted value, is the observed value at t, and is the number of observed values.

2.4. Data Processing and Analysis

All analyses were performed using Arcmap10.1, Eviews 7.2, SPSS17.0, and Matlab2016b.

3. Results

From January 2014 to December 2017, there were 8923 TB cases in Urumqi, Xinjiang, and average annual TB cases were 2231, and annual TB incidence was 63.44 (per 100,000 populations). The change of TB incidence is shown in Figure 2. The TB incidence increased significantly in 2015 compared with 2014, and it changed little from 2015 to 2017, only slightly with the season.

Considering the relationship between the TB incidence and air pollution, we collected relevant air pollution data in Urumqi from January 2014 to December 2017, including PM2.5, PM10, SO2, CO, NO2, and O3. The six indicators changed over time as shown in Figure 3, and their statistical descriptions are shown in Table 1.

First of all, the TB incidences data were analyzed so as to establish the ARMA prediction model; the ARMA model needs the data to be stable; for this reason, the ADF test was used to test the stability of the original data, and the test results are shown in Table 2, the value was 0.044 and less than 0.05, which showed that the data was stable and could be used directly to found ARMA model. The ACF and PACF diagram of the data are shown in Figure 4, based on Figure 4, it was preliminarily judged that p took 1, q might have taken 1, 2, or 3, and the various tentative ARMA models were founded; these models with their parameter test results and AIC and SC are shown in Table 3. From the results of parameter test, it could be seen that only two models could pass all the parameter tests, namely, ARMA (1, (1, 3)) model and ARMA (1, (3)) model. The values of AIC and SC of ARMA (1, (1, 3)) model were smaller than those of ARMA (1, (3)) model, and the R2 value of ARMA (1, (1, 3)) model was also larger than that of ARMA (1, (3)) model; therefore, ARMA (1, (1, 3)) model was the best model.

The mathematical expression of the ARMA (1, (1, 3)) model was as follows:

Using this model to fit the historical data of TB incidence, the MAE was 0.633.

The residual sequence of the ARMA (1, (1, 3)) model was tested by Box–Jenkins Q test. The test results are shown in Figure 5, it can be seen that the correlation coefficient of the residual sequence at lag 6 was significantly greater than twice the standard deviation, which indicated that the residual sequence was not white noise sequence, and the accuracy of the ARMA (1, (1, 3)) model needed to be further improved.

Considering that the TB incidence may be affected by air pollution variables, the spearman correlation analysis between the TB incidence and the air pollution variables was carried out, and the results of the analysis are shown in Table 4. From the results of Table 4, it could be seen that only PM10 and SO2 were associated with TB incidence, further; cross-correlation analysis was carried out considering the lag effect of air pollution variables on tuberculosis incidence. Figure 6 shows the identified highest associated time lags TB incidence to each air pollution variable, and these highest lag correlation coefficients are shown in Table 5.

Then, on the basis of ARMA (1, (1, 3)) model, the time lead variables of air pollution factors, such as PM2.5 (lead 3), PM10 (lead 3), SO2 (lead 1), CO (lead 3) NO2 (lead 2), and O3 (lead 3) were included in new model for further analysis. To found the GLM model with the TB incidence as dependent variable, AR (1), MA (1), MA (3), PM2.5 (lead 3), PM10 (lead 3), SO2 (lead 1), CO (lead 3), NO2 (lead 2), and O3 (lead 3) as independent variables. In order to avoid pseudo regression, first of all, Pearson’s correlation analysis was performed on these independent variables, and the results showed that PM2.5 (lead 3) was highly correlated with PM10 (lead 3) and CO (lead 3) (correlation coefficient is more than 0.8), and PM10 (lead 3) was highly correlated with CO (lead 3) (correlation coefficient is more than 0.8), and PM2.5 (lead 3) was significantly correlated with NO2 (lead 2) and O3 (lead 3), and CO (lead 3) was significantly correlated with NO2 (lead 2) and O3 (lead 3), and MA (1) was significantly correlated with AR (1) (see Table 6); therefore, GLM could not be established directly, and the principal component regression was used to establish the model. Firstly, for AR (1), MA (1), MA (3), PM2.5 (lead 3), PM10 (lead 3), SO2 (lead 1), CO (lead 3), NO2 (lead 2), and O3 (lead 3), their principal components (PCs) were extracted. Since the cumulative contribution rate of the first six eigenvalues had reached 96.88 (see Table 7), it was enough to extract the six PCs for modeling. The six PCs were independent of each other, and the scores of the six principal components were calculated, to found the ARMA (1, (1, 3)) + model using six PCs as independent variables and TB incidence as dependent variable; finally, ARMA (1, (1, 3)) + model was established by stepwise regression method. The test results of ARMA (1, (1, 3)) + model showed that the ARMA (1, (1, 3)) + model was statistic significant, the R2 of the model was 0.851, and the MAE was 0.36, the parameter test results are shown in Table 8. The ARMA (1, (1, 3)) + model represented by principal components was as follows:

After converting PCs into standardized variables, the ARMA (1, (1, 3)) + model was as follows:

Based on ARMA (1, (1, 3)) + model, the monthly TB incidence in Urumqi from October 2017 to March 2018 was predicted, which were 5.15, 4.81, 4.89, 5.56, 5.75 and 6.56, respectively. The fitting and prediction diagram of the ARMA (1, (1, 3)) + model is shown in Figure 7, from which it could be seen intuitively that the fitting and prediction effect were good.

4. Discussion

Dynamic estimation and prediction of disease epidemic is an important link in the prevention and control of infectious diseases, and it is the main basis for health management departments to formulate prevention and control countermeasures and allocate resources [19, 24]. However, due to the influence of many uncertain factors on the prevalence of TB, it is difficult to identify early, which leads to the lag of the corresponding prevention and control measures. How to warn TB epidemic timely and effectively is an important part of disease prevention and control.

It is generally believed that air pollution factors affect the prevalence of TB by affecting the survival of the virus, population activities, individual behavior, and so on [25]. The effect of air pollution variables on TB epidemic also suggests that TB prediction should include some meaningful explanatory variables, and it should not be limited to historical data of TB incidence.

In this study, firstly, the single ARMA (1, (1, 3)) model based on the TB incidence in Urumqi from January 2014 to December 2017 was established. The AIC and SC of the ARMA (1, (1, 3)) model were minimal in the seven tentative ARMA models, and all the parameters test of the ARMA (1, (1, 3)) model were statistically significant, R2 of the model was 0.59 and the MAE was 0.633. The only deficiency of the ARMA (1, (1, 3)) model is that residual sequence of the model was not white noise, which showed that the ARMA (1, (1, 3)) model was not sufficient for information extraction. Taking into account the effect of air pollution variables (PM2.5, PM10, SO2, CO, NO2, and O3) on the TB incidence, this study performed spearman correlation analysis between air pollution variables and the TB incidence, and the results showed that only PM10 and SO2 were significantly correlated with the TB incidence, some studies suggest that the effect of air pollution variables and weather factors on infectious disease is delayed [2629], so cross-correlation analysis was done for studying the delayed effect of six air pollution variables on the TB incidence; it was found that the three-month delay in the TB incidence was most associated with PM2.5, PM10, CO, and O3, the one-month delay in the TB incidence was the most correlated with SO2, and the two-month delay in the TB incidence was the most correlated with NO2; in other words, PM2.5, PM10, SO2, Co, NO2, and O3 had a time-leading effect on the TB incidence, and their leading orders were, 3, 3, 1, 3, 2, and 3, respectively; to this end, we considered establishing a new model based on ARMA (1, (1, 3)) model combined with six variables, such as PM2.5 (lead3), PM10 (lead3), SO2 (lead1), CO (lead3), NO2 (lead2), and O3 (lead3), to predict TB incidence, which should improve the prediction accuracy, so the TB incidence as dependent variables and AR (1), MA (1), MA (3), PM2.5 (lead 3), PM10 (lead 3), SO2 (lead 1), CO (lead 3), NO2 (lead 2), and O3 (lead 3) as independent variables were used to establish a regression model; however, after careful analysis, it was found that there were high correlations (correlation coefficient greater than 0.8) and significant correlations () between independent variables, which might lead to pseudo-regression, in order to avoid this situation, the principal component regression method was used to further establish the model; then, six PCs were extracted from nine independent variables (AR (1), MA (1), MA (3), PM2.5 (lead 3), PM10 (lead 3), SO2 (lead 1), CO (lead 3), NO2 (lead 2), and O3 (lead 3)), and the cumulative contribution rate reached 96.88%, which indicated that the information extraction was sufficient. Using six PCs as independent variables and TB incidence as dependent variables, we established ARMA (1, (1, 3)) + model, its R2 (0.851) was more than that of ARMA (1, (1, 3)), its MAE (0.36) was less than that of ARMA (1, (1, 3)), which suggested that ARMA (1, (1, 3)) + model was better than ARMA (1, (1, 3)) model. By ARMA (1, (1, 3)) + model, the TB incidences from October 2017 to March 2018 were predicted, and the predicted values were 5.15, 4.81, 4.89, 5.56, 5.75, and 6.56, respectively, which was able to capture the trend pattern of TB in Urumqi.

In this study, it was found that air pollution variables had a significant time lag effect to TB incidence in Urumqi; from the results of cross-correlation analysis and ARMA (1, (1, 3)) + model analysis, it is found that the effect of PM2.5, PM10, SO2, CO, and NO2 on the TB incidence was negative, which seemed to be contrary to some research conclusions [30, 31], but after careful analysis, it was found that there was no contradiction, and the reason was that people often looked over the weather forecast before they planned to go out or participate in activities in Urumqi; when the weather forecast showed that the air pollution was serious, and there was heavy haze, people tended to go out less or participate less in some activities; this greatly reduced the chance of close contact between people, thus reducing the probability of being infected with TB. The effect of O3 on the TB incidence was positive; that is, the higher the concentration of O3, the higher the incidence of TB. The main reasons were that on one hand, on sunny and high-temperature days, O3 concentration was relatively high, and in this kind of weather, people in Urumqi liked to go out to party or participate in activities, which increased the chance of close contact, thus increasing the probability of being infected with TB and that on the other hand, O3 at high concentration will strongly stimulate the respiratory tract, causing sore throat, chest tightness, and cough, causing bronchitis and emphysema, thus reducing the immunity of people infected with TB.

Many studies have found that both air pollution factors and meteorological factors affect the TB incidence of tuberculosis. Due to the limitation of data collection, in our study, only air pollution factors were included into the prediction of TB incidence; in further research, we will also consider meteorological factors.

5. Conclusions

Most predictive models of TB incidence were based only on historical data of TB incidence and did not consider the effect of air pollution on TB incidence, in this study, based on the data of historical TB incidence and air pollution variables, the ARMA (1, (1, 3)) + prediction model of TB incidence in Urumqi was established, its prediction accuracy was higher than that of the ARMA (1, (1, 3)) model considering only historical TB incidence data. Based on this ARMA (1, (1, 3)) + model, we predicted the TB incidence from October 2017 to March 2018, which were consistent with the variation of TB incidence data in Urumqi. The prediction method used in this study was good, it can provide some reference ideas for other researchers to improve the accuracy of prediction model, and the ARMA (1, (1, 3)) + model can also provide prediction for the TB incidence in Urumqi, which can provide scientific reference for the prevention and control of the TB in Urumqi, China. In the modeling process, the effect of air pollution variables on the TB incidence was analyzed, and there is one thing to note: the higher the O3 concentration, the higher the TB incidence; on sunny, high-temperature days, O3 levels exceed the standard, and people in Urumqi preferred to go out to party or participate in activities in such weather, ignoring the dangers posed by O3. This study suggests that people in Urumqi pay more attention to O3 hazards and do a good job of personal protection.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by Provincial ministry co-construction of State Key Laboratory Project on the Causes and Control of High Incidence of Diseases in Central Asia, China (grant no. SKL-HIDCA-2017-12).