Abstract
Stochastic frontier model is an important and effective method to calculate industry efficiency. However, when dealing with temporal and spatial data from the industry, it is difficult to accurately calculate the industrial production efficiency due to the influence of spatial correlation and time lag effect. If the traditional spatial statistical method is used, the setting method of spatial weight matrix is often questioned. To solve this series of problems, one possible idea is to design a spatial data mining process based on stochastic frontier analysis. Firstly, the stochastic frontier model should be improved to analyze spatio-temporal data. In order to accurately measure the technical efficiency in the case of dual correlation between time and space, a more effective spatio-temporal stochastic frontier model method is proposed. Meanwhile, based on the idea of generalized moment estimation, an estimation method of spatiotemporal stochastic frontier model is designed, and the consistency of estimators is proved. In order to ensure that the most appropriate spatial weight matrix can be selected in the process of model construction, the -fold crossvalidation method is adopted to evaluate the prediction effect under the data-driven idea. This set of spatio-temporal data mining methods will be used to measure the technical efficiency of high-tech industries in various provinces of China.
1. Introduction
Stochastic frontier analysis (SFA) is an important method to measure technical efficiency and calculate total factor productivity. The whole process is divided into two steps: the first step is the model estimation process, which can be regarded as a supervised learning process; the second step is to use the estimated model to calculate the technical efficiency, which can be regarded as an unsupervised learning process.
From the perspective of machine learning, supervised processes have three main objectives: (a) feature selection and reduction of the dimension of feature variables; (b) selecting the optimal one from multiple classifiers or prediction models; (c) model evaluation, which estimates the prediction error of the selected classifier or prediction model on the new data.
The paper found that the traditional stochastic frontier analysis method has the following defects: (a) it is not suitable for the special structure of spatial data or spatio-temporal data; (b) the modeling process lacks variety. The traditional analysis process is knowledge-driven and completely relies on a single theoretical model for estimation and testing. The above two characteristics lead to the large deviation of the traditional stochastic frontier model when analyzing the spatio-temporal data, and it is impossible to make an accurate measure of the production efficiency with spatial relationship, either. To solve the two problems above, this study considers two improvements to the industrial efficiency calculation process based on temporal and spatial data: (1) improve the existing stochastic frontier model and make it suitable for spatial data or spatio-temporal data; (2) turn the modeling process into a spatial data mining process. In view of the unique structure of spatio-temporal data, a more suitable crossvalidation method is proposed for the selection of prediction model.
Stochastic frontier analysis (SFA) was successively proposed by Aigner et al. (1977) [1], Meeusen and Broeck (1977) [2], and Battese and Corra (1977) [3]. Over the past 40 years, its theoretical system and methods have been continuously expanded and innovated; it is widely used to measure the operating efficiency of different industries.
The development of spatial statistics provides a theoretical basis for studying spatial interactions in stochastic frontier models. Druska and Horrace (2004) [4] first applied the method of spatial econometrics to the analytical framework of stochastic frontier model and started the research of spatial stochastic frontier model. Affuso (2010) [5] established the spatial stochastic frontier model and gave the maximum likelihood estimation in the empirical study. Tonini and Pede (2011) [6] applied maximum entropy method to parameter estimation of spatial stochastic frontier model. Vidolia et al. (2016) [7], Tsionas and Michaelides (2016) [8], Carvalho (2018) [9], and Adetutu et al. (2015) [10] consider SF models with local spatial dependence. Jin and Lee (2020) [11] proved the asymptotic properties of a maximum likelihood estimator of a spatial autoregressive stochastic frontier model. Kutlu et al. (2020) [12] proposed a spatial autoregressive stochastic frontier model, which allows for the endogeneity in both the frontier and environmental variables, and discussed a single-stage control function approach to estimate the parameters.
Because spatial stochastic frontier analysis methods can fully consider the impact of spatial correlation, they can obtain more accurate results in efficiency analysis of data with spatial spillover effect and thus have been more widely used in recent years. Bergantino et al. (2020) [13] analyses the potential impact of airport competition on technical efficiency by applying the spatial stochastic frontier. Graaff (2020) [14] used spatial stochastic frontier model to estimate spatially correlated technical efficiencies within a European regional production function context. At present, some literatures have studied panel spatial stochastic frontier model, for example, Druska and Horrace (2004) [4], Tonini and Pede (2011) [6], and Lin Jia-Xian (2014) [15]. These literatures all focus on the static panel space stochastic frontier model, and the model utilizes two-dimensional information from panel data; formally, the spatial lag term of the explained variable and the spatial lag term of the error are used to capture the spatial correlation of the production unit. The time lag term is not included in the model, which means that the model still cannot fit well when there is significant inertia in the research problem. In input-output analysis, current behavior is largely dependent on past behavior, for example, the adjustment of capital stock is often influenced by previous capital. Therefore, a dynamic stochastic frontier model should be established, and the model should describe the double lag effect of space and time, so as to reflect the influence relationship between economic variables more objectively. The spatial weight matrix in spatial statistics is often considered to be “subjective.” Moreover, due to the various setting methods of spatial weight matrix, the selection of different spatial weight matrix may lead to the difference of model estimation results. In addition, the selection of spatial weight matrix has not formed a unified principle. Based on the above three points, the spatial weight matrix is often questioned. But in the era of “big data,” such skepticism may end [16].
This paper proposes the spatiotemporal stochastic frontier model; considering that the model may be endogenous in time and space dimensions, a generalized method of moments (GMM) estimation process is designed to estimate the model. When Druska and Horrace (2004) [4] studied the static panel space stochastic frontier model, a generalized moment estimation process was proposed by referring to Kelejian and Prucha (1999) [17] for spatial error correlation. In this paper, Druska and Horrace (2004) [4] is used to deal with model’s error space autocorrelation, which is different from that of the stochastic frontier model. According to the method of Kapoor et al. (2007) [18], the compound error term was processed, and the moment condition was constructed to estimate the distribution parameters of the error term. In this paper, Jacobs et al. (2009) [19] was used as a reference to construct the moment condition, and Anselin (1988) [20] was used as a reference for the selection of tool variables to obtain the generalized moment estimator. Furthermore, the consistency of the obtained structural parameter estimators is proved by using the extreme value consistency theorem and the law of uniform large numbers (ULLN). To solve the problem of selecting spatial weight matrix, we can consider a crossvalidation method suitable for spatio-temporal data. Fortunately, a series of methods such as dimensionality reduction, feature selection, and model generalization has been provided by machine learning methods. The earliest crossvalidation method was called hold-out, which relied on only one partition of the data, and there was no crossover process, so it was also called the verification method [21]. Noting that the hold-out method relies on a partition of data and is easily affected by contingency factors, Geisser (2010) [22] proposed a crossvalidation method that includes the average of multiple hold-out estimates, realizing the transition from verification estimation to crossvalidation estimation. In order to reduce the combination number of data partition in crossvalidation, Shao (1993) [23] proposed the leave--out crossvalidation (LPOCV) in which the number of test samples in each data partition was the same. Especially in the special case when , the method is evolved to leave-one-out crossvalidation (LOOCV). LOOCV is the simplest and most widely used crossvalidation in traditional analysis. Compared with the LPOCV considering all data partitioning, Geisser (2010) also proposed a crossvalidation based on only partial data partitioning, which is called RLT method. -folded crossvalidation is proposed as an alternative to LOOCV which has a large computational overhead and relies on a basic partition of data divided into -fold, each of which has a data capacity of . In the case of limited samples, -fold crossvalidation is the simplest and most widely used method of generalization error estimation. From the various crossvalidation methods that have appeared in the past, each method fully considers the randomness of the validation set to ensure the generalization ability of the test model. However, for the special panel data such as spatio-temporal data, there is usually an internal connection between spatial individuals, and the overall data also tends to have time trend. This problem is not taken into account by the previous crossvalidation methods, which may break the inherent regularity of spatio-temporal data. Based on the above considerations, this paper designs a kind of crossvalidation scheme suitable for spatio-temporal data. It is used to select stochastic frontier models, especially models with different weight matrices.
Finally, the technology efficiency of China’s high-tech industry is analyzed by establishing a spatiotemporal stochastic frontier model.
2. Methodology
Previous studies on panel spatial stochastic frontier models mainly involved static panel spatial stochastic frontier models. Spatial lag effect is considered in the process of model building, but the influence of time lag effect is not included. If the time lag term and time-spatial lag term are added into the model, this kind of model can be called spatiotemporal stochastic frontier model. Obviously, the time-space double lag effect will produce stronger endogeneity, and new estimation methods should be considered to solve it.
2.1. Model Specification and Assumption
2.1.1. Model Specification
The general form of the spatiotemporal stochastic frontier model can be stated in matrix form as where , , , and are -dimensional vectors, whose components at time are given by , , , and . The vector consists of the outputs of the production units, and are the composite error vectors corresponding to , is the heterogeneous error vector, and is the vector of time-invariant inefficiency terms. This kind of setting is appropriate when the time span is not large. As is time invariant, it can be regarded as the individual effect, and thus, this paper primarily considers as a fixed effect. is an -dimensional matrix consisting of the exogenous input variables of the production units at time . and are spatial weight matrices which are usually assumed to be different. If , and cannot be distinguished by means of the maximum likelihood method although they can be effectively distinguished by the GMM method [19]. The variables are stacked according to the section and time series in the following matrix form: where represents the Kronecker product of matrices, and are, respectively, the identity matrices of orders and , and is a -dimensional column vector with all the entries equal to 1. The parameter vector of the model to be estimated is , and its dimension is , where is the parameters corresponding to the -dimension exogenous explanatory variables . and together constitute the structural parameters, and are the error term parameters.
2.1.2. Model Assumption
The assumptions of the spatiotemporal stochastic frontier model are the following:
Assumption 1. The distribution of the error vector is given by .
Assumption 2. The inefficiency term is time invariant, with distribution .
Assumption 3. , , and and have finite fourth moment.
Assumption 4. and are uncorrelated with .
Assumption 5. The spatial weight matrices and satisfy and ; . For arbitrary , , and , the matrices , , and are all nonsingular matrices. For each of the matrices , , , , , , , and , the row sums and column sums are all absolutely uniformly bounded.
Assumption 1 is a classic assumption of the spatial error autocorrelation model. By Assumption 2, the same individual inefficiency term remains constant at different times. When using GMM to estimate the structural parameters of the model, the distribution of error terms can be ignored; nevertheless, in order to improve the efficiency of computation, the half normal distribution for the inefficiency term is usually assumed. Assumption 3 ensures the boundedness of the variance of the error term in this model, which is an important condition for the consistency of the estimator. Assumption 4 is a classical assumption commonly used in traditional regression analysis methods, and the moment condition is set according to this assumption in the generalized moment estimation of this model. Assumption 5 is set according to the space weight matrix of this model and the properties of space station autoregressive coefficient and space-time autoregressive coefficient, which also ensures the consistency of parameter estimators.
2.2. Parameter Estimation
2.2.1. Estimation strategy
There is an endogeneity problem in the model. For the spatial lag term in the model, there is where , and , and can be obtained from the assumptions of the regression model, and is quadratic. In spatial econometrics, is usually not a zero matrix, and so, is not a zero matrix. While taking into account the expected value of the compound error term cannot be 0, it can be considered that the quadratic form is almost impossible to be equal to 0 (see Appendix A for proof). Therefore, in the dynamic panel spatial stochastic frontier models, there is an endogeneity problem which will lead to the inconsistency of traditional estimators. So, we considered GMM as a good way to solve the endogeneity problem.
The parameter vector to be estimated in the model is , where , , , and are the structural parameters of the model, and , , and are the error distribution parameter of the model. The estimation of the model is completed in three steps:
Step 1. Using the GMM to estimate the structure parameter in the model.
Step 2. Making a moment estimation of the parameter that is included in the error term.
Step 3. Using the estimator obtained in Step 2 to modify the result of Step 1.
2.2.2. Estimation of Structural Parameter
(1) Difference Model and Level Model. Anderson and Hsiao (1981) [24] proposed to use as the instrumental variable of , and then, 2SLS estimation is carried out. This estimator is called “Anderson-Hsiao estimator.” According to the same logic, lag variables of higher order are also valid IV. Arellano and Bond (1991) [25] used all possible lag variables as IV (the number of IVs is more than the number of endogenous variables) to conduct GMM estimation. This GMM estimator is called Arellano-Bond estimator or difference GMM. The disadvantage of difference GMM is that the variable which does not change with time is eliminated, and its coefficient cannot be estimated. If the series has a strong persistence, that is, the first-order autoregressive coefficient is close to 1, then the correlation may be very weak and lead to the problem of weak instrumental variables. In order to solve the above two problems, Arellano and Bver (1995) [26] returned to the level equation and used as IV to estimate the GMM of the level equation, which was called “level GMM.” Blundell and Bond (1998) [27] combined difference GMM with level GMM and estimated the difference equation and level equation as one equation system for GMM, which was called “system GMM.” The advantage of system GMM is that it can improve the efficiency of estimation (small sample properties are better), and it can estimate the variable that does not change with time (the system GMM contains the level equation). In order to solve the endogenous problem of dynamic panel data model, Arellano and Bond (1991) [25], Arellano and Bover (1995) [26], and Blundell and Bond (1998) [27], respectively, considered from the perspective of difference model and level model, and different instrumental variables were selected.
The corresponding difference model and level model of Equation (1) are simplified as (4) and (5) can also be collectively called spatial system model, where Equation (4) is the difference model, and Equation (5) is the level model, is the vector composed of all explanatory variables, and is the vector composed of structural parameters. The expansion of Equation (5) is Equation (1); the expansion of Equation (4) can be expressed as follows:
(2) Moment Condition and Instrumental Variable. Since is a strictly exogenous variable, it is not related to the compound error term , nor is it related to . The moment conditions for identifying in the difference model and the level model are as follows:
The moment condition structure for identifying and in the two models is as follows: since the spatial lag term and time lag term of the dependent variable are both endogenous variables, therefore, it is necessary to find a set of instrumental variables that is related to time lag and space lag and exogenous explanatory variables, but not related to the difference error term . Arellano and Bond (1991) [25] uses all possible level lag variables of as instrumental variables for the time-lag first-order difference term () of the dependent variable. These instrumental variables are related to (), but not to . The moment conditions corresponding to the difference model and the level model are as follows:
The moment conditions for identifying in the two models are as follows.
Construct a spatial lag item as follows; Jacobs et al. (2009) [19] provided a method of finding instrumental variables, that is time lag terms of spatial lag dependent variables, who also proved that the moment condition obtained by this method was as valid as Equation (8). So, corresponding to the difference model and the level model, the following moment conditions can be listed: where is the exponential of matrix and the integer is the maximum order of spatial lag that can be used as the instrumental variable.
In addition, based on the method provided by Kelejian and Robinson (1993) [28], formula (1) shows that depends on , so the instrumental variable can be selected by the first-order difference method for . Since is a strictly exogenous variable, it is not related to the compound error term , so corresponding to the difference model and the level model, the instrumental variables satisfy the following moment conditions:
(3) GMM Estimation. When we estimate the parameters of the spatio-temporal stochastic frontier model, we use the system GMM method similar to the general dynamic panel model to construct the spatial system GMM estimation. Unlike the system GMM, the IVs of the spatial system GMM are composed of time lag variable and spatial lag variable.
For each period of , the moment condition of can be given. The moment conditions corresponding to the difference model and the level model can be abbreviated as
The matrices and are expressed as follows
Then, and
That is, and are matrices with instrumental variables as column vectors, and the subscript means that the matrix depends on the unit number of individuals. Let and be block diagonal matrices composed of block and , respectively. In order to define the GMM (Spatial Blundell Bond, SBB) estimator of the spatial dynamic panel stochastic frontier model, the difference variables and level variables are combined to define the matrix as follows:
The instrument variable matrix is where is the instrumental matrix of spatial difference GMM estimation, and is the instrumental matrix of spatial level GMM estimation. The weight matrix is
This diagonal of the matrix is composed of the weight matrix defined in the process of spatial difference GMM estimation and an identity matrix, where is weight matrix which elements are
This weight matrix is proposed by Arellano and Bond (1991) [25] which is further define the weight matrix:
Through the above process, combining the spatial difference equation with the spatial level equation, we get the spatial system GMM estimation process. Get the objective function of generalized moment estimation for spatial system as follows:
The one-stage SBB estimator of can be obtained by minimizing Equation (20):
Equation (21) can also be called the spatial system GMM estimator.
(4) Improvement of Instrumental Variable Matrix. The instrumental variable matrix constructed in accordance with the above method has a high dimension and grows exponentially as the values of and increase. In order to reduce the dimension of the instrumental variable matrix and avoid overfitting the instrumental variable, we can simplify it by using the “condensing instrumental variable matrix” proposed by Beck and Levine (2004) [29].
We still set and in the GMM instrumental variable matrix of the space system, and the corresponding condensed instrumental variable matrix is where () is the instrumental variable quantum matrix of the level model corresponding to the individual.
2.3. Estimation of Error Term Parameters
2.3.1. Estimation of
After the estimator of the structural parameter in model (1) is obtained in the first stage, the model residual can be further obtained, in which the is the vector set composed of all explanatory variables in model (1). Consistent GMM estimation can be obtained by using residual and modifying the moment condition proposed by Kapoor et al. (2007) [18]. The specific process is as follows.
According to the assumptions of the model, the individual effect of the model is the inefficiency term. According to the covariance structure of the compound error term, it can be known that
Introduce transformation matrix: where and are the identity matrix of order and order , is the matrix of order , and the elements of that are all 1. Properties of transformation matrix : (i) ; (ii) ; (iii) (where is any matrix). From the properties (ii), we can further deduce the special properties (iv) of matrix in this paper, where and are the corresponding error terms in model (1).
Let
Then,
Based on the above transformation and referring to the first three of the six moment conditions given by Kapoor et al. (2007) [18] and related properties, the following three moment conditions are given in this paper:
To further integrate the above moment conditions, we can get that
Substitute Equations (26) and (27) into Equation (29) to obtain that
The residual estimated in the first stage is substituted into and in Equation (30) to obtain the sample moment equation. In the sample moment equation, the estimated value of and can be solved by the following objective function: where
2.3.2. Estimation of
The fourth moment condition given by Kapoor et al. (2007) [18] is where . However, considering the characteristics of the stochastic frontier model, the compound error term obeys the asymmetric distribution of the expected nonzero, so the moment condition (33) cannot be directly applied, and the following formula can be proved:
The estimator of parameter can be obtained by combining Equation (34) with Equation (31):
Substitute and into (33) to obtain the moment estimator of .
2.4. Spatial Correction of Estimators
Although it can be proved that the estimator (21) is a consistent estimator, it can also be proved that the consistency of the GMM estimator can be guaranteed even if the model has spatial error autocorrelation. However, the estimator (21) cannot solve the spatial dependence of the error term, and the variance of the estimator is relatively large. After obtaining the consistent estimator of by Equation (31), the consistent estimator can be obtained by a correcting transformation. According to the spatial correction method given by Jacobs et al. (2009) [19], the estimator obtained in the first step was corrected.
The estimator obtained from Equation (31) is used to construct matrix , and left the difference GMM and the explained variables and the instrumental variables matrix of the system GMM estimation, if
The corresponding explanatory variable set is corrected as
The instrumental variable matrix and weight matrix corresponding to the GMM estimation of the spatial system are corrected as follows:
Then, the corrected system GMM estimator is
3. Results and Discussion
3.1. Properties of the Estimators
3.1.1. Properties of the Estimator
According to the Extremum Consistency Theorem [30] (see Appendix B), the estimators and , obtained by Equations (21) and (27), are consistent.
Proof. see Appendix C.
3.1.2. Properties of the Estimators , , , and
It can be proved that the estimators , , , and are consistent. The proof of the consistency of , , and is similar to that in Kapoor et al. (2007) [18], and it is omitted here. The consistency of can be derived from the consistency of , , and . Similar to the consistency of GLS estimates, the correcting transformation does not affect the consistency of the estimator .
3.2. Crossvalidation Scheme and Selection of Spatial Weight Matrix
In order to avoid affecting the accuracy of model estimation due to the choice of spatial weight matrix, the optimal spatial weight matrix was selected by crossvalidation. This is a widely used model selection and generalization method in machine learning. However, since the data used is panel data and the model used is spatio-temporal model, the structural features of spatio-temporal data may be destroyed if the training set and validation set are generated by hold-out method and LOOCV or -folded crossvalidation. Therefore, this paper considers a stratified crossvalidation approach. For the spatio-temporal data, if and are assumed to be the number of spatial individuals and the number of periods contained in the observed samples, respectively, and the rest of the conventions on independent variables and dependent variables are the same as Equation (1), stratified crossvalidation can include the following three forms.
3.2.1. Leave-One-Out Crossvalidation for the Time Dimension (TLOOCV)
Select the date as the validation set and the rest of the date as the training set. Let denote, respectively, the index values of the observations contained in period (), and the number of observations contained in period (). Let denote the number of the observations in part . Do the above for each period in turn and calculate where , and is the fitting value of the th observed value in period .
This crossvalidation method is suitable when the number of periods is not too large.
3.2.2. -Fold Pooled Crossvalidation for the Spatial Dimension (SK-Fold PCV)
When the total number of periods is large and the number of spatial individuals is also large, this method is suitable for use. All observed values in each period () were randomly divided into groups of equal size (i.e., the subsample size of each group was ), and a group was randomly selected from each period () to obtain observed values combined as the validation set and remaining observed values in each period combined as the training set. Do the above for each period sequence and calculate where , and is the fitting value of the th observed value in period .
3.2.3. Leave-One-Out Crossvalidation for the Spatial Dimension (SLOOCV)
When the crossvalidation method presented in Section 3.2.2 and the condition attached, it can be called -fold pooled crossvalidation for the spatial dimension (SN-fold PCV) or stratified leave-one crossvalidation (SLOOCV). When the total number of periods is large and the number of spatial individuals is small, this method is suitable for use.
3.2.4. Determination of Weight Matrix and Industrial Efficiency Measure
To discuss industrial efficiency from the perspective of spatial statistics or spatial data mining, a good spatial weight matrix should be determined first. In this paper, the spatial lag production function is chosen as the basic model, and the spatial weight matrix involved in the construction of the model can take various alternative forms. In a data-driven way, the training samples were imported into the model for parameter estimation, and then, the most appropriate spatial weight matrix was determined by stratified crossvalidation. To determine whether the spatial model is selected for analysis, the spatial correlation test is further carried out. If there is a strong spatial correlation, a spatial stochastic frontier model or a spatio-temporal stochastic frontier model will be established; if the spatial correlation is weak, an ordinary panel stochastic frontier model will be selected. After the estimation is completed, the best performing model is used to measure the technical efficiency. The flow chart of the entire analysis process is shown in Figure 1.

4. The Efficiency of the High and New Technology Industries in China
China has developed high and new technology industries for many years in order to transform the economic growth mode, cultivating knowledge and technology intensive new companies with great growth potential and low resources consumption that provide a sustainable development. As the technology of such industries disseminates, some issues emerge, such as spatial technology spillover, continuity of technological upgrade, and delay from research and development to market acceptance. In this paper, we analyze the efficiency of this kind of industries by the above analysis process.
4.1. Introduction to the Model and Data
In the framework of spatio-temporal model analysis, the spatial lag production function model based on the Cobb-Douglas production function is chosen as the basic model for weight selection. The matrix form of the model is as follows: where , , and are, respectively, the logarithm vectors of the main business income, the assets investment, and the mean number of employees of the high and new technology industries in every province in China; is the spatial weight matrix of the 31 provinces in China. Among many spatial weight matrices, we choose the three most commonly used spatial weight matrices in economic problems and their combination forms. Various weight matrices and their interpretations are shown in Table 1.
Before the fitting, each weight matrix was row-standardized, and the optimal weight matrix determined by crossvalidation was implemented to establish the stochastic frontier model. Starting from the ordinary panel stochastic frontier model and considering the spatial correlations, we construct the static panel spatial stochastic frontier model and the spatiotemporal stochastic frontier model. In order to determine if the model variables present spatial correlation, we let them go through a spatial correlation test, and to determine if there should be a time lag term in the model, we test the significance. Comparing the results of the three models and selecting the one that provides the best fit, we estimate the technology efficiency of the high and new technology industries in every province.
The matrix forms of the three models are as follows:
Ordinary panel stochastic frontier model:
Static panel space stochastic frontier model:
Spatiotemporal stochastic frontier model: where is a general vector of stochastic error, is the inefficiency term, is the identity matrix of order , and are the spatial correlation coefficients of the corresponding equations, is the spatiotemporal time lag coefficient of the spatiotemporal stochastic frontier model, and is the regression coefficient vector.
The development plan of high and new technology industry in China started in 1988, but due to the relatively slow progress in the beginning, the scale development of this industry did not start until the beginning of the twenty-first century. For this reason, we have chosen as research sample the panel data of the high and new technology industries of the 31 provinces in China from 2001 to 2018. The data of capital and labor input factors have been taken from “China high-tech industry yearbook.” Descriptive statistics are shown in Table 2.
Figure 2 is the histogram drawn by taking the intragroup mean of data in each region according to year.

Figure 2 shows the difference in investment and average development level of high-tech industries in different provinces of China from 2001 to 2018. As can be seen from the figure, Guangdong, Jiangsu, and Shandong provinces have the highest input and output levels of high-tech industries, and the differences among these three provinces are also very large. In the past 18 years, the average output value of the high-tech industry of Guangdong, which ranks first, reached 23.7 billion yuan, while that of Shandong, which ranks third, reached only 6.997 billion yuan, less than one third of that of Guangdong. In terms of the development and distribution of high-tech industries nationwide, the gap between provinces is even more obvious. The average output value of the high-tech industries in Tibet, which ranks the last, is only 0.085 billion yuan, less than 1/1000 of that of Guangdong.
4.2. Crossvalidation Results and the Selection of Weight Matrix
According to the characteristics of the data obtained, the time limit contained in the data is 18 years, which is relatively short and smaller than the number of regions. Therefore, the leave-one-out crossvalidation for the time dimension (TLOOCV) method was chosen. Each weight matrix in Table 1 was introduced into model (31), the training set data were imported into the model one by one for fitting, and then, Equation (40) was calculated to obtain the CV statistics corresponding to each weight matrix. The calculation results are shown in Table 3.
By comparing the calculation results of CV statistics of validation set, it can be found to be the optimal spatial weight matrix required by this paper.
4.3. Empirical Results
To ensure that spatial econometrics is applicable to the problem we are studying, we need to test the spatial correlation of the variables we are interested in. The most popular method to measure spatial autocorrelation is Moran’s index (Moran’s ):
where is the sample variance, is the () element of the spatial weight matrix (used to measure the distance between region and region ), and is the sum of all spatial weights.
The value of Moran’s is generally between -1 and 1, and its greater than 0 indicates positive autocorrelation. That is, the high value is adjacent to the high value and the low value is adjacent to the low value. Less than 0 means negative autocorrelation. That is, a high value is adjacent to a low value. If the Moran’s is close to 0, then the spatial distribution is random, and there is no spatial autocorrelation.
To test the existence of spatial correlations in the variables of the high and new technology industry, we calculate the global Moran’s indices of the production values of the industry from 2011 to 2018 (Table 4).
From the results of Moran’s index calculation, we found that the value of the index is smaller than 0.01 for every year, demonstrating that the index is significant below 1% for every year, and the average Moran’s index is also significant for every year. The Moran’s index reached the minimum value 0.286 in 2011 and the maximum value 0.340 in 2013. We observe that the production value of the high and new technology industries of every province shows significant spatial correlation for every year and conclude that the production values of these industries of the provinces in China have apparent spatial aggregating effects. Furthermore, the testing of spatial correlation and the location quotient calculation both demonstrated that the high and new technology industries in different regions of China have apparent spatial correlation. We therefore choose the panel space stochastic frontier model for the analysis.
Table 5 presents the estimation results of the static, spatiotemporal stochastic frontier model, where the static model was further analyzed by considering fixed and stochastic effects.
The spatial autoregressive coefficients and of the three models can all pass the significance test. From this and the spatial correlation test, one can conclude that the spatial stochastic frontier model is more reasonable. The estimated spatial autoregressive coefficients of the three models are all positive, which implied that the spatial effects have a positive impact on the development of the high and new technology industries. The negative value of the Hausman statistics of the static panel space model implies that the random effect model should be chosen. The random effect value of the static panel space stochastic frontier model is far greater than the value, and the value of is 0.719, implying that there apparently exists technical inefficiency. Comparing the spatiotemporal stochastic frontier model with the estimation of the static panel spatial random effect, the spatiotemporal lag coefficient of the former can pass the 10% significance test, obtaining that the spatiotemporal lag term in the model has a significant function. The distance statistic of the spatiotemporal stochastic frontier model pass the 5% significance distance test, proving that the spatiotemporal stochastic frontier model is globally significant. Moreover, the estimation of the value of the spatiotemporal stochastic frontier model is higher than that of the static panel spatial stochastic frontier model, demonstrating that the inefficiency term of the spatiotemporal stochastic frontier model has a more significant function. All the impact factor variables of the technical inefficiency terms of the analysis of the two models can at least pass the 5% significance test, and the signs of the regression coefficient of the two models are consistent, the numerical values are relatively close. Taken together, the above results all demonstrate that the analysis of the spatiotemporal stochastic frontier model is more reasonable and the development of the high and new technology industry has positive correlations in space and time. Due to space limitation, this paper does not report the annual technical efficiency of the high-tech industries of each province. The spatial-temporal stochastic frontier model is used to calculate the average technical efficiency of the high-tech industries of each province from 2001 to 2018 as follows.
It can be seen from the calculation results in Table 6 that the average technical efficiency value of the high-tech industries in all provinces in China is less than 1, which indicates that the actual output of the high-tech industries in all provinces has not reached the most effective output level, and there is technological inefficiency in production. The five-year national average technical efficiency level was 0.837, and there are obvious regional differences in the technical efficiency values presented in Table 4. Nine provinces (Beijing, Chongqing, Fujian, Gansu, Jiangsu, Liaoning, Shanghai, Tianjin, and Zhejiang) achieved an average technical efficiency of more than 0.9, seven provinces are located in the eastern region, one in the central region, and one in the western region. There are 11 provinces with average technical efficiency below 0.8, namely, Guangdong, Guangxi, Guizhou, Hebei, Henan, Jiangxi, Ningxia, Qinghai, Shaanxi, Xinjiang, and Tibet. Only one of the provinces is in the east, six in the central region, and four in the west.
5. Conclusion
In this paper, taking into account that the variables to be explained might be affected by the time lag term and the space-time interaction, we develop a dynamic model within the framework of the panel spatial stochastic frontier model. Due to the apparent endogeneity of the model, we use the systematic GMM method to estimate the parameters, choose suitable tool variables according to the model assumptions and variable characteristics, and construct the suitable spatiotemporal stochastic frontier model. We use the extreme value consistence theorem and the uniform law of large numbers (ULLN) to prove the consistency of the structural parameter estimators and of the estimators of the error term distribution parameters. Aiming at the selection of spatial weight matrix of spatio-temporal model, a stratified crossvalidation method is designed to select the most appropriate spatial weight matrix in a data-driven way according to the characteristics of spatio-temporal data. Although the spatial weight matrix selected by supervised learning may not be suitable for analyzing all problems, this data-driven model selection method is undoubtedly valuable and efficient.
From the analysis of the stochastic frontier model of high-tech industries in China and the measurement of their technical efficiency, we can draw the following conclusions.
There is a spatial positive correlation in the development of high-tech industries between different regions of China. The positive correlation between the output values of these industries in different regions has been obtained by calculating the Global Moran’s index in each year. The estimation of the spatial panel stochastic frontier model also indicates that the spatial autoregressive coefficient is positive, proving the existence of such a positive correlation which has a positive impact on the development of high-tech industries. There is also a spatial agglomeration effect and a spatial and temporal lag effect in these industries, illustrating that both static spill over and dynamic continuity occur in the development of the high-tech industries in China. The technical efficiency of high-tech industries is relatively low. The strategic emerging industries started earlier in eastern region, but developed more slowly than in the central and western regions.
The Chinese economy is at a critical stage of replacing old drivers of growth with new ones and transforming and upgrading industries. The new round of technological and industrial revolution 5.0 has given rise to new technologies, new industries, new forms of business, and new models. In this study, the data mining algorithm based on stochastic frontier is used to calculate industrial efficiency, which is not only suitable for high-tech industry but also helpful to further enrich the research on the efficiency of new industry and new mode and has certain practical significance to promote the steady development of the new round of scientific and technological revolution of industry 5.0.
Appendix
A. Proof
is an -dimensional nonzero composite random error term vector, and is set as interindividual nonautocorrelation according to classical econometric assumptions for simplicity. where is the element of the matrix . For any , , then the sufficient and necessary condition for the above formula to be 0 is , which is obviously inconsistent with the assumption of spatial weight matrix in this paper. Therefore, it is proved that
B. Extremum Consistency Theorem
If (1) (identification) is uniquely maximized at , (4) (compactness) the parameter space is compact, (8) (continuity) is continuous, and (9) (uniform convergence in probability) , then .
C. Proof
From (25), we let and , which are continuous functions of the corresponding parameters constructed by sample and global moment conditions, respectively. and are the weight matrix sample constructed and global constructed, respectively, where is the structural parameter vector of the model. The above settings satisfy the following conditions: (i)When , , and and are both positive definite matrices(ii), the parameter space of , is compact, and (iii)For an arbitrary point , is a continuous function(iv). Let
We first prove uniqueness of the maximum value of .
As , one has , which is the maximum value of given that is a positive definite matrix. From the uniqueness of the true parameter value, for , it is satisfied that
and therefore, is the unique maximum of .
According to the uniform law of large numbers (ULLN), when conditions (ii), (iii), and (iv) hold, and are continuous functions, and . Noticing the structure of the weight matrices and , and the fact that the instrumental variable matrices and can be regarded as and under the condition , one has . where in the last line, we have applied the triangle inequality.
Taking supreme on both sides of the above inequality, we obtain
According to the Extremum Consistency Theorem, we have .
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Shandong Provincial Social Science Planning Key Project “New Infrastructure” to Promote the Shandong Provincial Economic High-Quality Development Path Research (21BJJJ07), Special Research Project of Shandong Social Science Fund on the Conversion of Old and New Driving Forces “Study on the Space-time Evaluation Mechanism of Production Efficiency of New Industry and New Format in Shandong” (19CDNJ37), Social Science Research Project of Shandong University of Political Science and Law “Study on The Conversion of Old And New Driving Forces and Efficiency Index of Cultural Industry” (2019Q14B), Shandong Province Humanities Social Science Finance Application Key Project “Research on Rural Supply Chain Finance Model and Credit Risk Early Warning in Shandong Province under the Environment of Internet +” (2020-JRZZ-11), and National Natural Science Foundation of China-Shandong Joint Fund (No. U1806203). This material is also based upon work supported by Program for Big Data and Artificial Intelligence Legal Research Collaborative Innovation Center in Shandong University of Political Science and Law.