Abstract

Background. Imposter syndrome (IS), associated with self-doubt and fear despite clear accomplishments and competencies, is frequently detected in medical students and has a negative impact on their well-being. This study aimed to predict the students’ IS using the machine learning ensemble approach. Methods. This study was a cross-sectional design among medical students in Bangladesh. Data were collected from February to July 2020 through snowball sampling technique across medical colleges in Bangladesh. In this study, we employed three different machine learning techniques such as neural network, random forest, and ensemble learning to compare the accuracy of prediction of the IS. Results. In total, 500 students completed the questionnaire. We used the YIS scale to determine the presence of IS among medical students. The ensemble model has the highest accuracy of this predictive model, with 96.4%, while the individual accuracy of random forest and neural network is 93.5% and 96.3%, respectively. We used different performance matrices to compare the results of the models. Finally, we compared feature importance scores between neural network and random forest model. The top feature of the neural network model is Y7, and the top feature of the random forest model is Y2, which is second among the top features of the neural network model. Conclusions. Imposter syndrome is an emerging mental illness in Bangladesh and requires the immediate attention of researchers. For instance, in order to reduce the impact of IS, identifying key factors responsible for IS is an important step. Machine learning methods can be employed to identify the potential sources responsible for IS. Similarly, determining how each factor contributes to the IS condition among medical students could be a potential future direction.

1. Introduction

Imposter syndrome (IS) is defined by a sense of not belonging, of being out of place, and of believing that one's perceived competence and success are underserved by others. Typically, this is regarded as a personal issue that should be addressed by keeping a record of accomplishments to serve as a reminder of progress [13]. The IS, which arbitrates the relationship between perfectionism and anxiety and partially influences perfectionism and depression, was first cited by clinical psychologists Pauline Clance and Suzanne Imes in late 1978 [4].

According to a more recent systemic review published in 2020, the prevalence of IS in the general population ranged from 9% to 82% [5], whereas another study conducted in 2020 showed that it varied from 22% to 60% among physicians and from 33% to 40% among trainee physicians [6]. According to the current IS research in the United States, 57% of pharmacy students [7] and 15% of medical students have IS [8]. Indeed, IS is becoming a significant public health concern on a global and regional scale. For instance, the prevalence of IS among medical students has been found to be 30% in the US [9], 45.7% in Malaysia [10], and 47% in Pakistan [11].

Several studies have found a substantial link between IS and overall psychological discomfort [5, 7, 12] and demographic and personal characteristics, including age [5, 7, 8], gender [6, 8, 13], and academic year [6, 10, 11]. The evidence indicates that medical students experienced moderated-to-strong IS [14], which has psychological and academic consequences [911]. Moving into clinical studies can be particularly difficult and lead to poor confidence in the students [15, 16]. Furthermore, several studies have found that IS has a detrimental impact on medical students’ physical and intellectual well-being [10, 14]. As a result, individuals may miss out on chances since they are unaware of their capabilities.

These days, machine learning approaches help us to solve a variety of problems using machine knowledge. In addition to many algorithms, some of them have been shown to be the best for various purposes. There are various applications of machine learning algorithms, including health and medical issues, marketing, weather forecasting, and many more. For predicting IS, different statistical analysis and machine learning techniques can be applied to data collected from medical college students. Different statistical and machine learning methods can be utilized to build a predictive model in solving the classification, regression, and categorization problems [17]. Ensemble learning model yields better prediction in many other fields including sentiment analysis. To the best of the authors’ knowledge, there is no published study on imposter syndrome among Bangladeshi medical students. Hence, this study proposes a solution to predict the students’ IS utilizing the ensemble learning approach that incorporates random forest (RF) and artificial neural network (ANN). By utilizing the features obtained from RF and ANN, we train and test the final ensemble model. This type of approach is not thoroughly explored in the current literature that we have found so far. The results validate the improved performance of our proposed technique employing the four commonly used performance metrics, namely accuracy, precision, recall, and F-1 score. Therefore, the major contributions of this study can be summarized as follows:(i)Populating a real-life dataset from the Bangladeshi medical students for predicting young imposter syndrome(ii)Utilizing ensemble learning to predict IS among students in a course or school year, forecast whether a student is likely to suffer from IS(iii)Evaluating the ensemble model for predicting the IS employing accuracy, precision, recall, and F-1 score

2. Materials and Methods

In this section, we followed the steps mentioned below for our research. Data collection is imperative for any type of study. In the data collection section, we showed our employed method of data collection and reasoning. Then, we moved on to the data processing step, where different methods of data cleaning and data tabulation are employed for further processing. Finally, we used two different types of machine learning techniques, that is, ANN (artificial neural network), RF (random forest), to generate an ensemble model for predicting IS. Figure 1 represents the overview of the methodology that is employed in this study.

2.1. Data Collection
2.1.1. Sample Size Determination

To achieve the desired sample size in our study, we used the snowball sampling method. The snowball sampling method is of the nonprobability variety. We used this type of sampling method in particular due to the lack of a sampling frame and the nonavailability of desired data. Using Equation (1), [18] we obtained the minimum sample size required for the study.where is the required sample size, z is the standard normal variate, p is the estimated proportion of the population with the attribute in the question and p + q = 1, and d is the allowed maximum error in estimating a population proportion. We considered the degree of accuracy, where d = 0.05 and , as the approximate proportion of the YIS, which was derived from a study conducted in Pakistan [11]. After calculation, we reached a minimum sample size of 382. However, we included more than 382 samples to reach a more accurate conclusion in our research. Finally, we managed to collect 500 unique samples for our study and discarded any incomplete questionnaires.

2.1.2. Study Design and Settings

This study was conducted using the cross-sectional survey to assess whether a student had IS or not. Our research was conducted among medical college students who study in Dhaka city. We did not include samples from any medical college outside of Dhaka city. Data were collected from 1st February 2020 to 30th July 2020. In our research, we collected samples from both public and private medical colleges which are situated in Dhaka city. Throughout our survey, we used snowball sampling (a nonprobability sampling method) to achieve desired sample size. We had to use this type of sampling method because of the lack of the sampling frame and the nonavailability of desired data. Participants of the study were students of both public and private medical colleges ranging from 1st to 5th year of study. We collected the sample data using face-to-face interviews. Respondents were reassured that all the information collected in our survey would be kept strictly confidential and would not be used for anything other than research purposes. Written consent was taken from the respondents. Then, we briefly introduced the student to our study and presented the questionnaire. The response was collected immediately. Initially, we identified some potential IS cases among our participants. Then, we asked them to refer others for the study. Afterward, we contacted them via e-mail or mobile phone and collected data after a face-to-face interview. The collected data were compiled for further processing and analysis. After cleaning the raw data, we had 500 complete observations for our study. We discarded any incomplete questionnaire.

2.1.3. Scale and Measures

(1) Dependent Variable. In our research, we considered having IS as the dependent variable. The dependent variable is categorized into two categories, that is, “Yes” (has IS) and “No” (does not have IS). “Yes” is denoted by 1, and “No” is denoted by 0.

The Young Imposter Syndrome (YIS) [9] scale was used to determine whether the person had IS or not. The scale was in the form of questions, and a student was considered to have imposter syndrome if they answered “Yes” to five or more of them. The YIS scale contains a total of eight items, and if a student scored on the YIS scale, then we considered him/her having IS otherwise, and he/she does not have IS. Detailed information on how respondents addressed all eight questions [9] is displayed in Table 1.

(2) Independent Variable. We considered respondents’ sociodemographic and some specific academic data as independent variables. Gender (male, female), age group (18–21, 22–25), Body Mass Index (BMI), smoking status (current, past, never), and living with family (yes, no) were among the sociodemographic variables. The WHO BMI standard scale [19] was used to categorize BMI. The respondents’ economic situation was determined by their monthly family income. Monthly family income (MFI) was divided into four categories: 20,000 BDT (approximately US$237), 21,000–30,000 BDT, 31,000–40,000 BDT, and 41,000 BDT. This MFI interval was a category that we created. The academic year was classified as first, second, third, fourth, or fifth, and the reasons for attending medical school were classified as personal preference, family preference, failure to qualify for the interesting department (departments in which they wanted to study except medical science), and better job facility (They chose to study medical science because of the better job facility it offers). Each response was divided into two categories (yes, no). All pertinent information was obtained directly from each participant.

2.2. Data Processing

After the interviewing process, we collected all questionnaires. Then, the collected data were compiled for processing and analysis by using statistical software (SPSS). First, we removed any duplicate or irrelevant data from our dataset for the sake of data cleaning. Then, we fixed any structural errors such as strange naming, typos, incorrect capitalization. Next, we validated the dataset according to the aim of the study. Finally, after cleaning and tabulation of raw data, we had 500 complete observations for our study purpose. We did not face any problem regarding missing value because we discarded any incomplete questionnaires from our study.

2.3. Predictive Model Generation

For developing the predictive model, the ensemble learning model [20] is utilized. The study utilizes weighted ensemble modelling using two machine learning approaches, namely random forest [21] and neural network [22]. With the same weight of 1, the final ensemble model is generated as the predictive model. The parameters and structures of all the models are provided below. For the predictive model’s validation, K-fold cross-validation is employed instead of train test split as the train test split method introduces bias to the model, which is one of the reasons for the poor accuracy of any model. To eschew this problem, the five folds validation is employed in generating a predictive model.

2.4. Random Forest

Random forest (RF) is an ensemble learning method primarily based on ensemble-based decision trees [23, 24]. For selecting feature importance, OOB or out-of-bag data were used, and to evaluate feature importance, 2 steps were taken, that is, (1) two-thirds of the training dataset were used to build the predictive classifier, and the remaining dataset was used to evaluate the classifier's performance. (2) The decrease in prediction performance was used to assess the importance of each feature. We reported the performance evaluation in terms of accuracy and the Gini index. In RF, the Gini index is used to determine the ability of potential discrimination among features. The Gini index is defined as per Equation 2.where p (j|t) is the estimated class probability for feature or node t in the decision tree and j is the output data or class. In our study, j = 2 is represented as IS = Yes and IS = No. Because MDGI is more robust than the mean decreases of accuracy, it was used to select the important IS parameters [25]. The IS parameter with the highest MDGI value is regarded as the most important feature because it has the greatest impact on predictive performance.

Random forest classifier is employed in this study from the Scikit-learn python package [26]. The model is tuned with the following parameters:(i)n_jobs: −1. This parameter represents the number of jobs running parallelly. −1 is used to instruct the system to use all available processors in the system, rather specifying the number of jobs.(ii)criterion: gini. This parameter represents the measure of the quality of the data split. For this study as the measure, Gini impurity is utilized.(iii)max_features: 0.9. This parameter is used to specify the number of features to be used in the data split mechanism. Here, 90% of the features are utilized in each split.(iv)max_depth: 4. This parameter points the maximum depth of the tree. In this study, maximum depth of the tree is specified to 4.(v)eval_metric_name: logloss. This parameter is used to evaluate the model performance. In this model, logloss is used as the evaluation metric.(vi)validation_type: kfold. This parameter is used to validate the model prediction on new data. In this model, Kfold is used for the validation.(vii)k_folds: 5. This parameter is used to provide the k value for the kfold validation method. For this model, a K value of 5 is used for the kfold validation measure.

2.5. Neural Network

In our study, we used the artificial neural network (ANN) method to make a primary prediction as part of the ensemble method. The ANN is a mathematical model that replicates actual biological neural aspects to make a computational model that can map input and output [27]. Each neuron works like a basic unit that performs the following equation.where y is the neuron output, xi is the neuron output, is the weight, and finally, b is the deviation. Each neuron gives a single output (y) from all input (xi). All neurons are connected through multilevel architecture [28] where the orientation of input and output is done by employing the following equations:where and are matrices bi is the vector that is learned from the parameters of the dataset. L is the number of layers or levels and 1 iL, h0 = x.

In our case, the ANN input includes normalized values of 9 variables such as institution, age, sex, BMI, smoking status. And the output value is the presence of imposter syndrome (IS = yes or no). All datasets are divided into two parts, training and test set. K = 5-fold cross-validation is used as a method of validation. We used logloss value as a measure of performance for ANN [29]. Mathematically logarithmic loss or log loss is defined by where N is the number of samples, M is the number of possible labels, yij is the binary indicator for whether or not j is the right classifier for i, and pij is the model probability assigned to j for instance i.

For this study, mljar’s AutoML [30] method is used for employing a neural network that utilizes TensorFlow [31] and Keras [32] for the neural network algorithm. Applied parameters for the neural network are as follows:(i)n_jobs: −1. This parameter represents the number of jobs running parallelly. −1 is used to instruct the system to use all available processors in the system, rather specifying number of jobs.(ii)dense_1_size: 32. This parameter represents the dense layer one for this Neural Network model. For this model, 32 neurons are used in the first dense layer of the network.(iii)dense_2_size: 16. This parameter denotes the size of the second dense layer of the network, where the number of neurons is 16 in this dense layer.(iv)learning_rate: 0.05. This parameter is used for setting up the learning rate of the model. A learning rate is the measure of change that needs to be updated in response to the estimated error each time the weight of the model is updated.(v)validation_type: kfold. This parameter is used to validate the model prediction on new data. In this model, Kfold is used for the validation.(vi)kfolds: 5. This parameter is used to provide the k value for the kfold validation method. For this model, a K value of 5 is used for the kfold validation measure.(vii)eval_metric_name: logloss. This parameter is used to evaluate the model performance. In this model, logloss is used as the evaluation metric.

2.6. Ensemble Model

In this study, we employed forward stepwise selection techniques for ensemble modelling. We used forward stepwise selection from a library of models to generate a subset of models that, when averaged together, gives relatively better predictive results than other models. In this particular modelling technique, three principles are involved in the process to make a better prediction and reduce the overfitting of the data [33] which are as follows:

2.6.1. Selection with Replacement

Model selection without replacement improves when best models are introduced to the ensemble but quickly decreased the performance of the ensemble as other models are being included in the ensemble. In this case, model selection with replacement provides a better prediction for all performance metrics [34].

2.6.2. Sorted Ensemble Initialization

When ensembles are small, the forward selection can overfit the selection. In this process, instead of starting with the empty ensemble, we sorted the models in the library according to their performance and finally put the best models ANN and RF, in the ensemble.

2.6.3. Bagging of the Ensemble Models

Ensemble learners are prone to overfitting the data. So, we took a sample from the data and trained the models. Once we are done, we took another sample so on. Finally, we bagged ANN and RF model predictions together using a simple average of the predictions.

In this study, for building an ensemble model, a random forest classifier and a neural network are used, and both models are equally weighted. The model is trained with a random forest classifier and then with neural networks and combining their prediction settings; our ensemble model is generated. Figure 2 represents the predictive ensemble model generation in this study.

2.7. Performance Metrics

In this study, we used Kfold cross-validation instead of train test split as the train test split method introduces bias to the model. In order to evaluate our predictive model, we used several measurements such as sensitivity/recall (Sn), accuracy (Ac), precision (Pn), F1-score, logloss, and Matthews correlation coefficient (MCC). The equations for Ac, Sn, Pn, F1-score, and MCC measurements are presented as follows:where TP, TN, FP, and FN are the number of true positive, true negative, false positive, and false negative, respectively. An MCC coefficient +1 represents perfect prediction when 0 represents random prediction and −1 represents total disagreement between observation and prediction [35]. In addition, we also demonstrated ROC (receiver operating characteristics) and AUC (area under curve).

2.8. Experimental Setup

For this experiment, Python programming language is used. The version of python used is 3.6.5 with Anaconda. Keras version 2.4.3, scikit-learn version 0.23.1, tensorflow 2.2.0, and mljar-supervised version 0.8.9 are used for model training and prediction model design. As the operating system, windows 10 version 21H1 (build 19043.1151) 64 bit is used. As hardware, Intel Core i3 6300T with 12 GB RAM and SSD setup is used.

3. Results and Discussion

The experiment shows that the ensemble prediction model provides better results in terms of accuracy, precision, recall, F1, AUC, logloss, and phi coefficient (mcc) metrics than the random forest and neural network’s individual prediction model for predicting YIS. In Table 2, the output for all evaluation metrics for all the models is provided.

From Table 2, for all evaluation measures, the neural network performs better than random forest, and the ensemble model performs better than the neural network. For the accuracy of this predictive model, the ensemble model is the highest performer with an accuracy of 96.4%, where individual accuracy for random forest and neural network is 93.5% and 96.3%, respectively. Though there is a very slight improvement in the accuracy measure between the ensemble and neural network model, there is a good amount of improvement in logloss value in the ensemble model over the other two individual models. LogLoss (logloss) value is one of the most important measures for classification problems in terms of probability. The lower the logloss value is, the better the prediction is. This prediction result can be clearer with the confusion matrix represented in Figure 3. From Figure 3, it is seen that, in the confusion matrix of the random forest model, the true positive value is 1, which means it classifies all true positive as true positive values. However, the true negative score is .77, which means this model classifies true negative classes as true negative 77% times and 23% times it classifies false negatives. With the neural network model, this false negative value is decreased to 10% from 23%, and in the ensemble model, these settings are kept and used for the final prediction. In the final prediction model, the false negative predictions are decreased, compromising the false positive prediction of 1%.

From the receiver operating characteristics (ROC) curve of the final ensemble predictive model provided in Figure 4, the predictive model’s performance can be observed. This figure contains the ROC curve for positive and negative classes with a micro average of the measures. The area for class 0 and class 1 is 99%, which is a good representation of the model’s prediction capability as 100% is the maximum value for this measure.

Feature importance is also calculated for both random forest and neural network models. A heatmap of feature importance is presented in Figure 5. For both models, top valued features are common, and they mostly belong to the questions. Table 3 provides the side-by-side importance value comparison for features pointed out by random forest and neural network models. For the neural network, top feature is Y7 (a student believes that they are less smart than the other people), and for the random forest, Y2 (a student sometimes shy away from challenges because of a nagging self-doubt) is the top feature, which is the second to top feature for the neural network model. If we compare the top 8 features between neural network and random forest, we can see that the top 8 features of neural network lie between 0.38 and 1.4 whereas, in the case of the random forest, they lie between 0.043 and 0.061. In every case, the neural network assigns a relatively higher importance score to the variables than the random forest algorithm. The main goals of feature selection are to avoid overfitting and improve model selection and also to gain deeper insight into the underlying process [36]. According to the goals of feature selection, better-fitted models give a relatively higher feature importance score than other less-fitted models. In this case, in Table 3, we showed that in every case, the general accuracy of the neural network when predicting the IS is better than the random forest model.

After reviewing the experimental result, it can be deduced that utilizing an ensemble model is better in predicting outcomes rather than utilizing a single machine learning model. This study aimed to generate a predictive model with better predictions for young imposter syndrome, and the ensemble model provides better prediction than two other single models having an accuracy of 96.4% and an AUC score of 98.8%.

4. Conclusion

In this research, we discussed the accuracy of different machine learning algorithms in predicting IS. We also demonstrated that when predicting IS, combined methods such as ensemble learning outperform single machine learning methods such as random forest and neural network. Importance score was used to illustrate how two different machine learning techniques assigned different importance scores to the same variable. The data we used in our work were collected from different medical colleges rather than a single institution to cover most of the medical students studying in Dhaka. The primary aim of this study is to identify a well-fit model in predicting IS. To determine whether a student had IS, we used the YIS scale. We compared the results of each machine learning technique using multiple performance metrics. When compared to individual learners, the ensemble learning model scored higher on all performance matrices.

IS is an emerging mental illness in Bangladesh and requires the immediate attention of researchers. For instance, in order to reduce the impact of IS, identifying key factors responsible for IS is an important step. Machine learning methods can be employed to identify the potential sources responsible for IS. Similarly, determining how each factor contributes to the IS condition among medical students could be a potential future direction.

Data Availability

The data and questionnaire can be found in Imposter Syndrome Data Set [37, 38]. Data are available under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0). All the python codes can be found in Data and Code Repository for Young Imposter Paper for Complexity [39].

Ethical Approval

Ethical approval was obtained from the Institutional Review Board (IRB) at North South University, Dhaka, Bangladesh. The study objectives were explained to each participant and confidentiality, and anonymity was assured.

Written consent was obtained from respondents before proceeding.

Conflicts of Interest

The authors declare no conflicts of interest.