Abstract
Excessive mental workload affects human health and may lead to accidents. This study is motivated by the need to assess mental workload in the process of human-robot interaction, in particular, when the robot performs a dangerous task. In this study, the use of heart rate variability (HRV) signals with different time scales in mental workload assessment was analyzed. A humanoid dual-arm robot that can perform dangerous work was used as a human-robot interaction object. Electrocardiogram (ECG) signals of six subjects were collected in two states: during the task and in a relaxed state. Multiple time-scale (1, 3, and 5 min) HRV signals were extracted from ECG signals. Then, we extracted the same linear and nonlinear features from the HRV signals at different time scales. The performance of machine learning algorithms using the different time-scale HRV signals obtained during the human-robot interaction was evaluated. The results show that for the per-subject case with a 3 min HRV signal length, the -nearest neighbor classifier achieved the best mental workload classification performance. For the cross-subject case with a 5 min time-scale signal length, the gentle boost classifier achieved the best mental workload classification accuracy. This study provides a novel research idea for using HRV signals to measure mental workload during human-robot interaction.
1. Introduction
Nowadays, robots, instead of humans, work in unstructured environments, expanding the scope of human work. Humans interact with robots through visual, tactile, and other feedback [1–4]. The robot can be operated remotely to complete a dangerous task; this operation can be challenging for humans. At present, research in the field of robotics primarily focuses on how robots perform human control instructions, how they perceive environmental information, and how autonomous operation can be achieved [5, 6]. However, this research neglects the robot’s assessment of the human’s psychological activity and the emotions of humans interacting with the robot. Therefore, it is of great significance to accurately measure the mental workload of the operator during their interaction with the robot [7, 8].
Mental workload can be measured continuously and objectively using physiological signals. In particular, heart rate variability (HRV) signals have been widely studied because they are easy to collect. In [9], the relationships between mental workload and time-domain, frequency-domain, and Poincare plot features of 5 min signals were analyzed. In [10], 5 min HRV signal segments were used to detect the mental workload of a worker. Several linear features (time and frequency domains) were utilized. Then, the combination of principal component analysis and support vector machine (SVM) achieved 84.4% accuracy. In fact, the physiological system of the human body can be regarded as a nonlinear system. However, the nonlinear nature of HRV signals cannot be reflected by linear analysis methods [11, 12]. In [13], the mental workload of performing MATA-II tasks was measured using 5 min scale HRV signals. This study extracted the multiscale entropy features of the HRV. Using those, it obtained a higher accuracy for mental workload recognition than using traditional time- and frequency-domain features. In [14], 5 min length HRV signal segments were utilized to evaluate the mental workload of hospital staff. A variety of conventional and multiscale HRV features were extracted, and SVM was used as the classifier. The results showed that the multiscale features obtain a better mental workload recognition effect. In [15], the respiratory and HRV signals were extracted using 5 min scale electrocardiogram (ECG) signals. This study introduced a novel method that fused respiratory and HRV signals to assess subtle variations in sympathovagal balance using ECG recordings during the MATA-II mission. Standard short-term HRV analysis is usually performed on 5 min recordings [16], and shorter recordings of HRV signals are being researched, aiming at a faster detection of mental workload. In [17], human HRV signals were collected during human-robot interaction through different types of wearable devices. Using signals of 3 min length, the linear features of HRV signals collected by different wearable devices were extracted, compared, and analyzed under different mental workload levels. In [18], 3 min HRV signals were used, and linear and nonlinear features were utilized. Several machine learning algorithms have been utilized for assessing the mental workload of humans while operating a dual-arm robot. In [19], 2.5 min HRV signals were detected by a consumer smart watch. Subsequently, the mental workload of human interaction with multiple robots was studied. However, analysis of mental workload recognition with HRV signals at different time scales is not sufficiently researched. In [20], a nonparametric statistical test method was utilized to analyze the significant differences between rest and stress phases with time scales of 30 s and 1, 2, 3, and 5 min. However, HRV signals were obtained from healthy subjects during an examination and in a resting condition, not during human-robot interaction.
Humans use visual, haptic, and other feedback information to remotely perceive the environment information during human-robot interaction, and the robot is remotely operated to complete the task. The entire human-robot interaction process requires the joint perception and decision-making of human hands, eyes, ears, brain, and other limbs and organs, which may be very challenging for the operator. At present, there is a lack of mental workload measurement analysis during human-robot interactions using HRV with different time scales. Therefore, in this study, the differences among HRV signals of multiple time scales in measuring mental workload were analyzed; six traditional machine learning methods were used to evaluate the performances of HRV signals with different time scales. Traditional machine learning methods were used because they are more suitable for small sample sizes. Although deep learning methods have been widely studied, many training samples are required.
The contribution of this study can be summarized as follows:(i)During human-robot interaction, HRV signals were collected based on a single physiological signal. In addition, linear and nonlinear features of different time-scale HRV signals were extracted, and statistical differences between the mental workloads in the two states were analyzed(ii)A variety of representative machine learning algorithms were applied. Differences in the performances of the machine learning algorithms with statistically different linear and nonlinear features extracted from HRV signals of different time scales in evaluating mental workload were analyzed(iii)Finally, the different performances of the algorithms with HRV signals of various time scales in evaluating mental workload are discussed
The remainder of this paper is organized as follows. Section 2 introduces the data collection and preprocessing algorithms. The mental workload assessment results of algorithms using different time scales of HRV signals and a discussion of the results are presented in Section 3. The concluding remarks are presented in Section 4.
2. Data and Method
The research block diagram is shown in Figure 1. It can be seen that the ECG signals were obtained from volunteers while they operated the dual-arm robot and in the rest state. HRV signals were then extracted from the ECG signals. Using a sliding window of different time scales (1, 3, and 5 min), the HRV signals were divided to obtain a collection of sample data of different time scales. Then, linear and nonlinear features of different time scales were extracted. In addition, an SVM, -nearest neighbor (KNN) classifier, gentle boost (GB), linear discriminant analysis (LDA), naive Bayes (NB), and decision tree (DT) were utilized to identify the task-performing and rest states. The performance differences of the classifiers in the mental workload evaluation with HRV signals at different time scales were compared and analyzed.

2.1. Data
In this subsection, the subjects and data acquisition processes are described. Then, a preprocessing algorithm is introduced to obtain the HRV signals from the collected ECG signals. In addition, multiple time-scale HRV signal segments are obtained using sliding windows of different time scales.
2.1.1. Participants
The ECG signals used for mental workload assessments were obtained from six male participants. A description of the six subjects is provided in Table 1. They were recruited from the Shenyang Institute of Automation, Chinese Academy of Sciences. Their average age was 25.16 (±2.93). They had normal or corrected vision and were all healthy, with no nervous system diseases. Before starting the experimental data collection, all participants were informed of the entire data collection process and precautions.
2.1.2. Data Acquisition
In this study, the operating object was a dual-arm robot shown in Figure 2. It can be seen that the robot consists of six wheels and two arms. Moreover, each wheel is independent, and each arm has seven degrees of freedom to access all positions in space. In addition, the top of the robot is equipped with a binocular camera for environmental observations. The robot controller is an exoskeleton device that can be worn by an operator (Figure 3). The exoskeleton controller also has two arms, and each arm has seven degrees of freedom, similar to the dual-arm robot. The ECG signal collection process is shown in Figure 4. A portable sensor was placed on the chest of the operator for the acquisition of ECG signals. The captured ECG signals were sent to a computer via Bluetooth for processing. ECG signals were collected in two states of the operator: during the operation of the dual-arm robot and during rest.



2.1.3. Signal Preprocessing
The HRV signals refer to a time series consisting of intervals between each pair of heartbeats. Therefore, to obtain the HRV signals, it is necessary to detect the peak and trough values of the ECG signals. Therefore, the Q, R, and S waves of the ECG signal were detected using a QRS wave group detection method [21]. However, there may be an abnormal point in the output RR interval sequence. Therefore, a classical median-filtering algorithm was applied to the output RR interval sequence [22]. The RR interval sequence was regarded as an HRV signal. As shown in Figures 5–7, sliding windows at different time scales (1, 3, and 5 min) were used with an overlap of 30 s. HRV signals were then divided into six groups: M-1, R-1, M-3, R-3, M-5, and R-5 groups. The M group signals represent the operator in the task-performing state, and the R group signals represent the operator in the rest state.



The proposed mental workload assessment preprocessing algorithm is described in Algorithm 1, where is the ECG data recorded from the th participant, and is the number of participants. The purpose of Steps 1 to 6 is to obtain the HRV signals from signals. The HRV signals are segmented into different time-scale (1, 3, and 5 min) segments , , and in Steps 7 to 10.
|
2.2. Method
Linear and nonlinear analysis methods are the most commonly used HRV signal analysis methods. Therefore, in this subsection, the linear and nonlinear features used in this study are described. The collection of physiological signals during human-robot interaction requires considerable manpower and energy; thus, it is difficult to collect large-scale sample data. However, machine learning algorithms do not require large-scale sample data for efficient feature recognition [23, 24]. Therefore, in this study, several different types of machine learning algorithms (SVM, KNN, GB, LDA, NB, and DT) were used to compare the effects of feature recognition.
2.2.1. Feature Extraction
First, the linear and nonlinear features used in this study are presented. In human-robot interaction, the fluctuation of the operator’s mental workload is related to the fluctuation of the human autonomic nervous system (ANS). The ANS consists of the sympathetic and parasympathetic nervous systems. The time- and frequency-domain features of HRV signals can reflect fluctuations in the sympathetic and the parasympathetic nervous systems. In addition, nonlinear features can reflect the nonlinear dynamic characteristics of the HRV signal [25, 26].
The linear features include time- and frequency-domain features. First, we introduce time features.
SDNN denotes the standard deviation of all RR intervals:
RMSSD denotes the root mean square of the adjacent RR interval difference:
pNN50 denotes the ratio of the number of pairs of adjacent RR intervals with a difference of more than 50 ms:
In addition, all RR intervals were integrated and divided by the maximum density distribution parameter, and the mean and median of the HRV signals were also extracted as time-domain features.
In this study, all frequency-domain features were obtained based on the power spectral density [27]. Furthermore, the basic frequency-domain features are defined as the sum of the power spectra at different frequency ranges: ; ; ; and . The ratio of and is defined as
The percentage of , , and are defined as
The respective ratios of and to are defined as
Finally, two typical nonlinear analysis methods applied in this study are presented. These are sample entropy (SaEn) and detrended fluctuation analysis (DFA). On the one hand, SaEn is a method for investigating the dynamics of HRV signals. It has the advantages of strong antinoise and antijamming abilities. In addition, it can be used to analyze shorter HRV signals. In the case of large differences in the parameter value range, good consistency is still achieved. On the other hand, DFA is suitable for the analysis of nonstationary time series, and HRV signals have this characteristic. In addition, the DFA method can filter out the trend components in the HRV signal. Therefore, it can effectively avoid the disturbance of false correlations owing to noise and signal instability.
2.2.2. Mental Workload Recognition
In this subsection, the abstracted feature vector of HRV signals at different time scales is used to evaluate the mental workload. The different time-scale HRV features were analyzed using the -test to obtain the statistical significance of the difference between task-performing and relaxed states; was considered statistically significant [28]. Then, linear and nonlinear features with statistical differences were used to construct feature vectors as inputs to machine learning algorithms. Six different machine learning methods, SVM, KNN, LDA, GB, NB, and DT, were used in this study to exclude the effects of performance differences in machine learning algorithms.
After the initial HRV signal preprocessing, 1, 3, and 5 min time-scale HRV signals for mental workload can be assessed using Algorithm 2. The linear and nonlinear features of the th subject were extracted in Steps 2 to 5. is defined as the feature vector in the human task-performing state, and is defined as the feature vector in the human relaxed state. Steps 6 to 11 define the process of per-subject mental workload assessment. and are the training and testing sets of the th subject, respectively. Steps 12 to 18 define the process of cross-subject mental workload assessment. The extracted HRV features and of all subjects in task-performing and relaxation states are merged into matrices and in Steps 13 and 14, respectively. Then, in Steps 15 to 17, the merged matrices and are prepared for model construction and mental workload assessment.
|
3. Experimental Results
In this section, the mental workload recognition performance of classifiers with HRV signals of different time scales is presented. The statistical differences of the linear and nonlinear features extracted in this study among different mental workload levels were analyzed via a -test, and the feature vectors were composed of per-subject and cross-subject mental workload assessments.
To evaluate the performance of mental workload classification with different time scales, accuracy was used, which is defined as follows:where TP is true positive, FP is false positive, FN is false negative, and TN is true negative.
3.1. Per-Subject Mental Workload Evaluation
The results of per-subject mental workload evaluation at different time scales (1, 3, and 5 min) are presented. The samples of each subject were randomly divided into two sets. One was used for training the machine learning model, and the other was used for testing the model. In addition, to increase the reliability of the results, the average of the results repeated 500 times was regarded as the final classification result.
3.1.1. Results of Statistically Significant Features of 1, 3, and 5 min Length
Figure 8 shows the statistics of the significantly different (, , and ) features at different time scales of each subject. It can be seen that Subject 1 has more significantly different () features at the 3 min time scale, followed by the 1 min and 5 min time scales. Subject 2 showed more significantly different features at the 3 min time scale and at the 1 min time scale; the sum of the most significantly different () features and the significantly different ( and ) features was the largest. Subject 3 and Subject 4 have the most significantly different () features at the 3 min time scale. Subject 5 and Subject 6 have the most significantly different () features at the 5 min time scale.

3.1.2. Classification Accuracy of Different Classifiers with Different Time Scales
Figure 9 shows the classification accuracy of the mental workload using different classifiers with different time scales. Figure 9(a) shows the classification accuracy using SVM. It can be seen that the time scale with which the SVM achieved the highest average recognition accuracy was 3 min. In addition, the average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 95.30%, 97.54%, and 95.11%, respectively. Figure 9(b) shows the classification accuracy using KNN. It can be seen that the time scale with which the KNN obtained the highest average recognition accuracy was 3 min. In addition, the average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 96.09%, 98.77%, and 96.21%, respectively. Figure 9(c) shows the classification accuracy using GB; it achieved the highest average recognition accuracy with the 3 min time scale. In addition, the average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 93.17%, 95.90%, and 90.61%, respectively.

(a)

(b)

(c)

(d)

(e)

(f)
Figure 9(d) shows the classification accuracy using LDA; it did not achieve good classification performance with any of the three types of time scales. The average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 52.02%, 52.27%, and 52.28%, respectively. Figure 9(e) shows the classification accuracy using NB. It can be seen that NB achieved the highest average recognition accuracy with the time scale of 3 min. The average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 80.52%, 84.99%, and 80.07%, respectively. Finally, Figure 9(f) shows the classification accuracy using DT. The average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 80.52%, 84.99%, and 80.07%, respectively.
3.2. Cross-Subject Mental Workload Evaluation
The results of cross-subject mental workload evaluation at different time scales (1, 3, and 5 min) are presented in this subsection. The sample data of five of the six subjects were selected to train the machine learning model. At the same time, the sample data of the remaining subject were selected to test the machine learning model.
3.2.1. Statistically Significant Analysis of Features
Table 2 shows the statistical differences between the two groups at the time scales of 1, 3, and 5 min. From Table 2, we can see that there were 17 features in the most significantly different category () and 2 features with significant differences () between groups M-1 and R-1. There were eighteen features in the most significantly different category () and two features of the significantly different category () between groups M-3 and R-3. There were 17 features in the most significantly different category () between groups M-5 and R-5.
3.2.2. Classification Accuracy of Different Classifiers with Different Time Scales
Figure 10 shows the classification accuracy of the mental workload using different classifiers at different time scales. Figure 10(a) shows the classification accuracy using SVM. It can be seen that when Subject 3 was used as the test subject, the classifier achieved the worst classification accuracy. The average classification accuracies of the classifier across all subjects with the 1, 3, and 5 min time scales were 77.59%, 75.06%, and 78.51%, respectively. Figure 10(b) shows the classification accuracy using KNN. Again, when Subject 3 was the test subject, the worst classification accuracy was achieved. The average classification accuracies of the classifier across all subjects with the 1, 3, and 5 min time scales were 69.24%, 70.40%, and 73.53%, respectively. Figure 10(c) shows the classification accuracy using GB. It can be seen that GB showed the worst classification accuracy with the time scale of 1 min and the best accuracy with the time scale of 5 min, both when Subject 2 was the test subject. The average classification accuracies of Subject 1 to Subject 6 with the 1, 3, and 5 min time scales were 63.53%, 71.55%, and 80.56%, respectively. Figure 10(d) shows the classification accuracy using LDA. It can be seen that the classifier showed the worst classification accuracy with the time scale of 3 min and the best accuracy with the time scale of 5 min, both when the data of Subject 3 were used as the test set. The average classification accuracies with the 1, 3, and 5 min time scales were 44.44%, 35.92%, and 53.92%, respectively. Figure 10(e) shows the classification accuracy using NB. It achieved the worst classification accuracy with Subject 3 as the test subject and the time scale of 5 min. It obtained the best accuracy with Subject 2 and the time scale of 5 min. The average classification accuracies with the 1, 3, and 5 min time scales were 64.53%, 66.48%, and 66.50%, respectively. Figure 10(f) shows the classification accuracy using DT. It can be seen that DT showed the worst classification accuracy with Subject 1 and the time scale of 5 min and the best accuracy with Subject 4 and the time scale of 5 min. The average classification accuracies with the 1, 3, and 5 min time scales were 65.03%, 67.91%, and 59.48%, respectively.

(a)

(b)

(c)

(d)

(e)

(f)
3.3. Discussion
Studies have shown that HRV can be used to measure and evaluate the mental workload of operators during human-robot interaction. Different time scales of HRV signals for mental workload measurement analysis have been widely studied. However, they were not based on a dataset of human-robot interaction. In addition, for the same dataset, the mental workload measurement analysis of human-robot interaction using HRV signals of different time scales was not reported, and there is no relevant public dataset. Hence, in this study, ECG signals were collected from six volunteers during task performance and rest. The fluctuation in the mental workload is closely related to the fluctuation state of the ANS, and HRV signals can react to the fluctuating state of the ANS. HRV signals of different lengths show levels of nervous activity information about the mental workload. This study presented a detailed comparative analysis.
First, the HRV signals at different time scales (1, 3, and 5 min) of the same individual were analyzed. Using a -test, the statistical differences between the task-performing and rest states were analyzed. The results are shown in Figure 8. These are the values of 1, 3, and 5 min time-scale HRV signals and the results with statistically significant features per subject at different time scales. It can be seen from Figure 8 that Subject 1 to Subject 4 show the most significantly different features at the 3 min time scale, whereas Subject 5 and Subject 6 have slightly less than the 5 min time scale. Moreover, there were a total of 75, 87, and 78 features with the most significant differences () for the 1 min, 3 min, and 5 min time-scale HRV signals of the six subjects, respectively. It is shown that at the time scale of 3 min, there are more significantly different features than at the other time scales. The classification analysis of mental workload was performed using the features with statistical differences () and six types of classifiers. The results are shown in Figure 9. It can be seen that the average accuracy across the six subjects with the 3 min time scale was the highest, i.e., 98.77% with the KNN classifier. The average accuracy across the six subjects at 1 min and 5 min were 96.09% (KNN) and 96.21% (KNN), respectively. This difference may be because the 1 min time-scale signal contains a limited amount of information. Although the 5 min time-scale signal contains a sufficient amount of information, the number of samples split from the collected signal is relatively small, which affects the training accuracy of the classification model. The signal length of 3 min contains sufficient time- and frequency-domain information, and more samples can be divided from the collected signals. Therefore, at a time scale of 3 min, the HRV signal analysis of the same individual obtained a high average classification accuracy. In addition, using 1, 3, and 5 min signals achieved high overall recognition accuracy and further verified that HRV signals can reflect the operator’s mental workload changes during human-robot interaction.
HRV signals between different individuals were then analyzed. Using a -test, the statistical differences between the task-performing and rest states were analyzed. The results are presented in Table 2. Table 2 shows that 17, 18, and 17 features were the most significantly different () for 1 min, 3 min, and 5 min time-scale HRV signals of the six subjects, respectively. The classification analysis of mental workload was performed using the features with statistical differences () and six types of classifiers. The sample data of five of the six individuals were used as the training set, and the sample data of one individual were left as the test set. The results are shown in Figure 10. It can be seen that the average accuracy of cross-subject identification is highest at 80.56% (GB) with the 5 min time scale, and the accuracies with 1 and 3 min time scales were 77.59% (SVM) and 75.06% (SVM), respectively. We found that the accuracy of cross-subject mental workload recognition was much lower than the per-subject mental workload recognition. This is because there are strong individual differences in HRV signals. Although HRV signals can reflect the fluctuating state of the ANS, there are differences in the psychological and physical qualities of different individuals. Therefore, to study cross-subject mental workload recognition, we need to further investigate the HRV signal to reflect the common characteristics of different individuals and to establish a universal mental workload recognition model.
4. Conclusion
In this paper, the differences in the recognition of the mental workload during human-robot interaction using multiple time-scale HRV signals were analyzed. First, ECG signals were obtained from six subjects while they were performing a task and while staying relaxed. Then, HRV signals were extracted based on the ECG signals. Furthermore, the HRV signals were divided into different groups using sliding windows of 1, 3, and 5 min. Then, several linear and nonlinear features of HRV signals were extracted for these different groups. Finally, six different machine learning algorithms were used to assess the mental workload performance. For the per-subject evaluation of mental workload with different time scales, the HRV signals of each individual were used for training, and then this individual’s mental workload was assessed by the trained model. In the case of a 3 min signal length, the KNN method obtained an average accuracy of 98.77%. For the cross-subject mental workload evaluation, the HRV signals of five of six individuals were used to train the model. Then, the trained model identified the mental workload of the remaining individual. The highest average classification accuracy was obtained by the GB algorithm using the 5 min time scale, and its average accuracy was 80.56%. This study explores the problems of the operator’s mental workload recognition during human-robot interaction using different time-scale HRV signals. However, the sample size in this study was limited; in the future, more data will be collected for analysis to provide generalizable experimental results. In addition, online identification of human-robot interaction mental workload will be studied. Furthermore, different machine learning algorithms will be combined to choose the best recognition result of mental workload by voting.
Data Availability
Because the physiological signal of the human body involves personal privacy, so the experimental data will not be made public temporarily.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This research was funded by the National Natural Science Foundation of China (Grant number U20A20201), the Liaoning Province Doctoral Scientific Research Foundation (Grant number 2020-BS-025), the Liaoning Revitalization Talents Program (Grant number XLYC1807018), and the National Key Research and Development Program of China (Grant number 2016YFE0206200).