Abstract

By improving the logistic regression function and selecting a step-by-step fitting result using the least square method as the input of the logistic regression model, this paper analyzes the situation of students failing the course. Compared with arctan-exp function and sigmoid function, the former has better robustness and stability and makes the results tend to 0 and 1 and be classified. An improved algorithm based on logistic regression and least square method, which combines the advantages of both functions, is studied. Finally, an implementation algorithm is presented to well meet the function. Besides, through a simulation example, both theoretical analysis and experimental evaluation demonstrate the effectiveness of our proposed approach, and it shows that the nonlinear arctan-exp function, which bases on the least square method, is used as the distribution function to predict the effectiveness of students’ failure. The algorithm has been compared and evaluated, which obtains superior results in terms of both accuracy rate and recall rate of the diagnosis results of the students failing.

1. Introduction

As students enter the university stage, their learning enthusiasm drops sharply, and failure in the exam has become a common phenomenon. For students themselves, whether their failure in the exam is related to their graduation on time. For teachers engaged in educational administration and student work, this has become the focus of their attention.

At present, the methods of studying students’ performance and daily behavior are mainly qualitative theoretical guidance [14] and quantitative data analysis [59]. Vahedi et al. [1] discussed students’ in-class information and communication technology, their motivations, and perceptions, as well as their attitudes toward restriction and integration of information and communication technologies in the classroom. LaFreniere and Shannon [2] examined parents’ relational maintenance in terms of college students’ resilience, family communal coping, and stress. Results indicate that college students’ resilience mediates the relationship between parents’ relational maintenance strategies and them and their perceived stress and that communal coping serves as a mediator between parents’ relational maintenance strategies and college students’ resilience. A between-within linear regression model is used to investigate whether a higher exposure to life challenges is associated with poorer health (between individuals) and whether changes in students’ life challenges are associated with changes in health (within individuals). It is obtained that higher exposure to student life challenges is associated with poorer mental health and self-rated health. An increase in student life challenges within individuals was associated with poorer mental health and self-rated health (see Porru et al.’s study [3]). Zhang et al. [4] combined social cognitive career theory (SCCT) and the stimulus-organism-response (SOR) model to explore the psychological cognition and attitudes derived from students during their learning, discusses the pattern of student learning satisfaction enhancement from the aspect of process, and further understands the relationships among social support systems, interaction relationships, self-efficacy, generic skills, and learning satisfaction. In this study, 800 valid copies of questionnaires are collected from 12 universities through purposive sampling, and the structural model was analyzed by partial least squares structural equation modeling (PLS-SEM). Shao-yang et al.’s study [5], based on geographic information system, spatial distribution law and main driving factors of the total phenolic acid content in Angelicae Sinensis Radix in Gansu were quantitatively analyzed. Mori et al. [6] proposed several different stopping rules (SRs) and cost functions (CFs), improved the definition and design of the method, and extend the experimentation to analyze the characteristics of methodology, including the influence of the parameters in detail, and include a new state-of-the-art method in the comparative analysis. Liu et al. provide the learning rate for the kernel-regularized regression based on reproducing kernel Banach spaces. The rate is provided in both expected mean and empirical mean. The results show that the uniform convexity influences the learning rate (see Liu et al.’s study [7]). Aiming at the protection of users’ commodity viewing privacy in a commercial website, Wu et al. [8] propose to construct a group of dummy requests on a trusted client, then, which are submitted together with a user commodity viewing request to the untrusted server-side, so as to confuse and cover up the user preferences. Jing et al. [9] propose a sliding force prediction model based on belief rule-based (BRB) inferential methodology. Sliding force monitoring on the subsection of the West-East Gas Pipeline Project in China is taken as an example to describe the uncertainty and nonlinear relationship between input and output variables.

In the quantitative model, the main methods used are machine-learning algorithms, such as deep neural network [10], logistic regression model [1116], and some other combined models. Yousafzai et al. [10] have used the attention-based bidirectional long short-term memory (BiLSTM) network combined with an attention mechanism model, which is based on advanced feature classification and prediction. This work is really vital for academicians, universities, and government departments to early predict the student performance (grades) efficiently from historical data. In order to explore the influence of students’ gender characteristics on students’ specialties, Hu and Hu [11] used logit regression to conduct an empirical analysis of the middle school-stage students’ performance and believed that schools should formulate different education programs according to the gender factors of students, so as to teach students in accordance with their aptitude. Nurhasanah et al. [12] received a model of interest from the Banda Aceh city students through a binary logistic regression approach, and its results show that the factors that influence students’ interest in continuing their studies at the Universitas Syiah Kuala are the ability to be affiliated with others, goals, and expectations. Wang et al. [13] have conducted a student survey and analyzed the results of professional English, which provided strong support for teaching reform. Yan and Liu, Sugilar et al. and Okewu et al. [1416] have also made progress in quantitative analysis of student data and research on student behavior and achievement.

Many of the aforementioned research methods are based on multiple and multistage surveys of students’ historical achievements and students’ willingness. However, the detailed scores of students obtained by most university teachers in regard to student work are difficult to use because of different elective subjects and other reasons. Only the final total score has a certain value. At the same time, the real name questionnaire is inevitably mixed with some subjective factors. Therefore, some information that is easy to obtain for these teachers to analyze students’ performance needs to be chosen. It is clear that many factors, such as students’ grade, gender, consumption, dining, borrowing books, and students’ learning in the previous stage will affect students’ performance at this stage and whether they fail the exam [1721].

Logistic regression analysis often uses the function. But function has better stability in some regions than the function. Based on the aforementioned literature inspiration, this paper will use arctan-exp function, further improving the nonlinear regression model the fitting accuracy and make the input of logical regression closer to the actual goal, and lead to more obvious classification results. It is to effectively analyze the students’ failure by easily obtained students’ basic information and daily behavior.

Before ending this section, it is worth pointing out the main contributions from the study as follows:(1)From the easily available data, the behavior of each influencing factor is studied by binary correlation analysis.(2)Improving the nonlinear regression model, the fitting accuracy to take arctan-exp function as the distribution function, and the regression function fitted by least square method to take as the input of the distribution function, a better robustness and stability prediction model is obtained.

The work draws lesson from the idea of trinary tree and K-means [22, 23] and provides a better reference for the improvement of classical decision-making problems.

2. Classical Method Introduction

2.1. Logistic Regression (LR)

LR [24, 25] is widely used in the field of judgment or secondary classification, which is a classification method based on multiple regression model. LR model is popular among industries due of its simplicity, parallelization, and strong interpretation.

The outliers of the commonly used multiple regression functions have too much influence on the results, so in the logical analysis, it is necessary to select a function with strong robustness as the distribution function. Generally speaking, the sigmoid function is taken as the distribution function, which is expressed as follows:where z is the fitting variable of multiple regression.

When the multiple linear regression function is used as the fitting function, the distribution function can be expressed as equation (2).where is the linear regression coefficient, is the function offset,  =  is the independent variable vector.

Therefore, the probability of positive prediction is , and the probability of negative prediction is , namely

It can be obtained that the probability of correct prediction iswhere is the predicted value of a sample and the value is 1 or 0.

The critical threshold of maximum likelihood estimation is used as the decision boundary. When the prediction result is greater than the threshold, it is 1, otherwise it is 0.

2.2. Least Square Method (LSR)

LSR [26, 27] is a regression model algorithm of dependent variable  =  to multiple independent variable  = . It takes into account the advantages of principal component analysis (PCA), canonical correlation analysis (CCA), and multiple linear regression, which can continuously improve the accuracy of regression function through multiple iterative fitting.

The m-order polynomial [26, 27] about is used. When the fitted function is nonlinear, it is expressed thatwhere is the loss function, is the output vector of sample, and is the coefficient matrix, which can be obtained by the formula: .

3. Method improvement

3.1. arctan-exp Distribution Function

According to the existing results [2830] and equation (4), the distribution function affects the probability of logistic regression. Comparing with sigmoid function, function has better stability in some regions. function is as follows and shows the impressions between function and function in Figure 1.where z is the fitting variable.

In Figure 1, the green curve shows the figure of equation (5), and the red curve represents the figure of equation (1). As shown in the figure, each has its own advantages. The ordinate is in about 0.2 to 0.8, the curve of equation (6) obviously rises faster. But in the other region, the function tends to be more stable. In order to further improve the robustness of the distribution function and combine the advantages of both functions, equation (6) needs to be improved. If variable in can be changed rapidly, better robustness and stability can be obtained.

Among the basic functions, has the best growth; therefore, equation (7) can reach the goal:where , , and are coefficients.

In order to make the value range of equation (7) as consistent with equation (1), suppose . In addition, the rapid growth of is mainly reflected in the range of , but it is not obvious at . Therefore, in order to ensure that equation (7) can achieve the expected effect, let and , where is a positive number.

Substituting the aforementioned formula into equation

For the convenience of calculation, take to obtain equation (9).

In Figure 2, the curve transformation of three functions is drawn respectively. In the figure, , , and represent the curve changes of equations (1), (5), and (9), respectively. From the figure, it can be found that function has the best trend and the best robustness.

In order to better compare the three functions, equations (9), (10), and (12) are introduced for evaluation, and the corresponding curves are made, as shown in Figure 3.where denote the sig, arctan, and arctan-exp functions, respectively.

In Figure 3, the constant curve corresponding to equation (12) is used as a comparison between equations (10) and (11). The larger the value corresponding to the curve, the better the trend and stability of the distribution function on the numerator in the formula described by the curve than that on the denominator. When the curve is at a point = 1, it means that the distribution function corresponding to the numerator and denominator at that point is equal. In the figure, it can be seen that the curves represented by and are ≥1 at each point. Therefore, it is easy to find that the logical regression of function as distribution function can get better results.

3.2. Curve Fitting

When the fitting function has better fitting degree and higher accuracy, the prediction accuracy of logistic regression will also be notably increased. Here, with the help [30, 31] of other people’s ideas, multiple nonlinear fitting on the data will be realized by partial least square method.

It is not easy to realize multivariate nonlinear fitting directly using partial least squares method, and it may cause large errors.

Therefore, with the help of existing research results [31, 32], the multilevel least square method is used to fit the curve, and use the term to predict the difference between the real value of the term and the predicted value, so as to achieve the approximate effect.

In order to ensure the good prediction results, factors need to be sorted according to the correlation, so that the factor with the best correlation is the first item and the factor with the worst correlation is the item.

According to equations (2) and (9), equation (13) can be obtained.

In equation (13), is the independent variable matrix, is the coefficient, and is the function offset.

4. Model steps

As shown in Figure 4, Parts 1 is the import data and Parts 2 is the nonlinear fitting of the least squares method. After multiple iterations, the required fitting degree is reached, then equation (5) is obtained; Parts 3 modifies the distribution function as equation (6) and calculates the critical threshold of logistic regression. Finally, Parts 4 performs logistic regression analysis.

5. Data Processing and Basic Analysis

5.1. Data Introduction

The factors, which affect whether engineering students fail, can be divided into objective factors and subjective factors. Objective factors include students’ grade, gender, family economic situation, and so on. Subjective factors include students’ major, number of absences, book borrowing, consumption, meals, and grades of the previous semester.

This paper studies whether students fail or not based on the subjective and objective factors, and it is easily available to teaching staff.

Suppose that some data are given as follows:

There are 576 groups of student information in Table 1. They are divided into training set and test set, and continue the follow-up research. In Table 1, there are 384 groups of data in the training set and 188 groups of data in the test set.

5.2. Basic Analysis

As there are many kinds of data collected, it is inconvenient to establish a concise and efficient prediction model. Therefore, SPSS is used to analyze the binary correlation between the aforementioned training set data and whether to fail or not, and eliminate the factors with weak correlation. See Table 2 for specific results.

From Table 2, it can be noted that there is a great impact between grade, student gender, consumption frequency, breakfast frequency, and grade point of last semester. Therefore, five factors are chosen to analyze whether students fail or not.

6. Result Analysis and Test

6.1. Evaluation Criterion

Logistic regression is not regression analysis. The commonly used regression evaluation criteria root mean square error (RMSE) and coefficient of determination (R2) are not suitable. Therefore, two new indicators are introduced for evaluation. That is, accuracy rate and recall rate.

TP is applied to predict a positive sample as a positive sample, FP to predict a positive sample as a negative sample, FN to predict a negative sample as a positive sample, and TN to predict a perfunctory book as a negative sample. Therefore, the accuracy rate and recall rate are expressed as equations (14) and (15), respectively.

6.2. Result Analysis

In order to verify the optimality of the model, this study compared the diagnosis of students’ failing in different model settings. As shown in Figure 5, Parts 1 is the influencing factor, and Part 2 is the six methods of comparison. Then, the diagnosis results of hybrid diagnosis model are compared with those of , , , , and models. The accuracy and recall parameters are used to determine the model accuracy respectively—see equations (13) and (14), and Parts 3 of Figure 5.

In general, there is a certain gap between the diagnostic results of and and the actual values. Results shows that the deviation in the diagnostic results obtained using the and models is reduced compared with the first two models. The error of model is also improved. The comparison of diagnostic results shows that distribution function significantly reduces the diagnostic error of logistic regression. Through comparative analysis of the six models, the hybrid diagnostic model has the best-fitting performance and is the final model recommended in this paper.

Table 3 shows the accuracy rate and recall rate of the diagnosis results of the students failing in the examination of the six models compared in this paper. In Table 3, the first two columns are the prediction results of the training set, and the last two columns are the prediction results of the prediction set. From the table, we can get the following information:(a)For weak correlation information, the effect of nonlinear prediction is better than that of linear prediction.(b)Under the linear condition, the prediction set using function as the distribution function is better than , but the training set is worse; has a small improvement on the prediction effect.(c)In the case of nonlinearity, the effect of function is slightly higher than that of function; the function gives the most accurate results.

In addition, the difference of each logistic regression function can be seen from the completion time in Tables 4 and 5.

7. Conclusion

In order to analyze the factors affecting engineering students’ failure and predict the failure, an improved logistic regression model and verification are adopted in this paper. The key findings drawn from the study are listed below:(1)Based on the easily available data, the characteristic indexes of each influencing factor are extracted by binary correlation analysis. The final selection factors include: Students’ learning in the previous stage, students’ grade, students’ gender, and students’ breakfast times.(2)Using the data screened in the previous step, taking arctan-exp function as the distribution function and the regression function fitted by least square method as the input of the distribution function, an efficient and accurate logical classification prediction model is obtained.

The model proposed in this paper is suitable for the prediction and analysis of engineering students in colleges and universities. Student managers and students themselves can predict their own learning situation in advance through the analysis model, so as to achieve the purpose of alertness and prediction. It provides strong support for teaching staff to reasonably adjust teaching management and explore new teaching mode by analyzing the results.

Data Availability

The data used in the case in this work are not easy to publish directly because it involves students’ personal privacy, but those who want to use the model can obtain similar data from the student information published by the University’s student and teaching management department or on the Internet.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (project no.11871116).