Abstract

The traditional humanoid robot dialogue system is generally based on template construction, which can make a good response in the set dialogue domain but cannot generate a good response to the content outside the domain. The rules of the dialogue system rely on manual design and lack of emotion detection of the interactive objects. In view of the shortcomings of traditional methods, this study designed an emotion analysis model based on deep neural network to detect the emotion of interactive objects and built an open-domain dialogue system of humanoid robot. In affective state analysis language processing, language coding, feature analysis, and Word2vec research are carried out. Then, an emotional state analysis model is constructed to train the emotional state of a humanoid robot, and the training results are summarized.

1. Introduction

With the progress of science and technology, robots have gradually entered every aspect of people’s lives. From industrial use to military applications, home service, education, and laboratories, robots are playing a significant role [1]. According to the three laws of robotics [2], the ultimate goal of robot development is to make robots imitate human intelligent behavior, to help humans better complete tasks, and to achieve goals [3]. In the process of human and robot cooperation to complete tasks, human beings cannot avoid the need to better communicate with the robot in order to complete tasks more efficiently [4, 5]. Traditional interaction between human and computer is human mainly through keyboard, mouse, and other manual input equipment to tell the computer information and computer through the display and other peripherals to feedback information to human. This interaction is inconvenient and requires a large number of peripherals. And in ordinary life, it is not guaranteed that everyone can use a computer [6]. Different from the traditional interaction between human and computer, the interaction between human and machine is carried out through some well-known natural channels, such as speech, vision, touch, hearing, proximity, and other human interactions to complete tasks [7]. This kind of interaction is familiar to most people and is more concise and efficient. So it can help people and robots interact more effectively to complete tasks [8]. The emotion analysis model of humanoid robot can analyze and identify the emotional information of the interactive object in the process of interaction, which is an important part of the dialogue system [9]. In the process of interaction, the language of the object of interaction contains rich emotional information and the text content information is a high-level expression of human thinking.

2. Literature Review

The main implementation methods of traditional text sentiment analysis are generally divided into two kinds: sentiment dictionary and machine learning algorithm. Text emotion analysis is usually based on emotion dictionary. At present, relatively well-known emotion dictionaries include HowNET, Chinese Polar Emotion Word (NTUSD) from Taiwan University, and English emotion dictionary Word Net from Preston University [10]. The emotion analysis process based on the emotion dictionary is shown in Figure 1.

Machine learning is used for sentiment analysis, which is regarded as a text classification task. The commonly used methods include Naive Bayes, SVM, and CRF. Li [11] compared Naive Bayes, maximum entropy model, and support vector machine algorithm in the emotion classification task of film reviews and found that SVM achieved the best classification effect. Huang [12] (2021) used the multistrategy method of SVM hierarchical structure to classify the emotional polarity of Chinese microblogs. The experiment shows that the SVM-based multistrategy method has the best effect, and the introduction of theme-related features can improve the accuracy to some extent. Lu [13] (2018) experimented with SVM, Bayes, and other classification algorithms and information gain and other feature selection algorithms in Chinese microblog sentiment analysis and took TF-IDF as the feature weight. Experimental results show that TF-IDF can be used as a feature weight, SVM can be used as a classification algorithm, and information gain can be used as a feature selection algorithm to achieve the best classification effect.

With the development of deep learning, deep learning models have also been applied to text classification. Law [14] (2013) proposed the reinforcement learning framework DISA based on CNN and LSTM by taking Chinese audio information and pinyin as emotion analysis features and achieved good results. Cela [15] (2013) applied the D-S evidence theory to integrate emotional information from visual, sound, and other aspects and analyzed the transfer rule of emotional state caused by the simultaneous action of the two factors. Finally, the emotion model is applied to the emotion robot system, so that the robot can generate emotions according to the external stimulus and make the corresponding expression. The experimental results show that the affective model is effective. Tidoni [16] (2014) combined the idea of a recurrent neural network with that of a convolutional neural network to improve the limitations of CNN in long-distance context capture and proposed RCNN for text classification. BaTula [17] (2017) proposed a game-based cognitive and emotional interaction model for robots based on PAD (pleasure-arousal-dominance) emotion space, aiming at the problem of lack of emotion and low participation of members in existing open-domain human-computer interaction systems. Experimental results show that compared with other cognitive interaction models, the proposed model can reduce the dependence of robot on external emotional stimuli and guide members to participate in human-computer interaction effectively.

There are various forms of information transmission. Due to the limitation of technology, it is impossible for robots to completely obtain the information of interactive objects, so intention prediction becomes important and essential [18]. Human interaction usually requires continuous prediction of intention. For example, in a conversation, people are constantly trying to predict the direction of a future conversation or the reactions of others through intention prediction [19]. Therefore, in order to make human-robot interaction more like human-human interaction, intention prediction in human-computer interaction is essential. According to the classification of human-computer interaction, intention prediction also has different processing situations and forms [20]. In cooperative human-computer interaction, intention prediction is mainly caused by incomplete interactive information. Humans and robots should cooperate to complete tasks. With the completeness of information, it is necessary to add the prediction of human intentions to complete tasks better and more efficiently.

Due to the linguistic phenomenon of polysemy and irony in Chinese, the method based on emotion dictionary cannot achieve high accuracy and is not suitable for cross-field research. With the geometric increase in information content, the establishment of a data-driven machine learning model for emotion analysis of irregular documents has a good application prospect.

3. Emotion Recognition Process and Data Acquisition Preprocessing

3.1. Emotion Recognition Process

In the process of interaction, to obtain text information, the voice content of the interactive object needs to be recorded through the microphone and converted into an audio and then the voice recognition module obtains the text information through voice recognition. The text information is preprocessed and fed into the emotion analysis model, from which the emotional state of the interactive object is output. As for the construction of emotion analysis models, this paper adopts the idea of machine learning to build a data-driven emotion analysis model [21]. The algorithm is selected to conduct offline training through data sets, and the model after training is reserved. The saved model is loaded and used for prediction. The text emotion analysis process based on machine learning is shown in Figure 2.

3.2. Data Acquisition and Preprocessing

The data includes data acquisition and data preprocessing. The specific contents are as follows.

3.2.1. Data Acquisition

In building the emotion analysis model of humanoid robot, we used the data set from the “Microblog Cross-Language Emotion Recognition Dataset” published by the International Conference on Natural Language Processing and Chinese Computing (NLPCC) in 2019 and 2020. The corpus is divided into positive and negative categories. Among them, there are 12,153 positive corpus categories and 12,178 negative corpus categories. The corpus content is from microblogs, and the sentences are colloquial, which is suitable for training the emotion analysis model.

3.2.2. Data Preprocessing

In data preprocessing, in order to ensure the uniformity of positive and negative categories in the corpus, 25 items were deleted from the negative label corpus through downsampling and the positive and negative samples were unified into 12,153 items. Since most of the corpus is taken from Weibo, it contains many emojis and repetitive punctuation marks. In addition, in practical application, the sentence after speech recognition will not have multiple repeated punctuation marks [22]. Based on the above points, redundant punctuation marks and emoticons were deleted in the preprocessing, which were not regarded as features. During word segmentation, the Jieba word segmentation tool kit was used in this study. A comparison table of punctuation and facial expressions in textual processing is shown in Table 1.

In the text vector space model, the commonly used feature selection methods include the chi-square test, information gain, mutual information method, and TF-IDF. TF-IDF combines word distribution information among documents in the word bag model and highlights key information by calculating absolute word frequency and inverted document frequency.

In the subsequent construction of emotion analysis models, we experimented with traditional machine learning models such as Bernoulli Bayes, Polynomial Bayes, and support vector machine (SVM) and neural network models such as Bi-LSTM, Bi-LSTM combined with attention mechanism, and Text-CNN.

4. Affective State Analysis Language Processing

4.1. Language Code

In natural language processing (NLP), sentence segmentation is generally carried out, with characters, words, and phrases as the minimum unit of estimation of Chinese language. Unique thermal coding is the simplest representation of this kind of feature. In unique coding, each different feature has its special state bit.

For example, “It’s a good book!”; after word segmentation, the result reads “This is a good book with good content!.”

[“This,” “book,” “Content,” “Good,” “!”] There are altogether 5 independent features in the above examples, which can be represented by independent thermal coding according to the order of word occurrence. The independent thermal coding of some features can be expressed as follows:“This”: [1, 0, 0, 0, 0, 0, 0, 0, 0] “book”: [0, 1, 0, 0, 0, 0, 0, 0, 0] “Content”: [0, 0, 1, 0, 0, 0, 0, 0, 0] “Good”: [0, 0, 0, 1, 0, 0, 0, 0, 0] “!”: [0, 0, 0, 0, 0, 0, 0, 0, 1]

In unique thermal coding, each feature has a single dimension and the dimensions of unique thermal coding are the same as the number of different features. To a certain extent, the unique thermal coding plays a role in expanding the features, but when the database dictionary content is large, this representation method takes up a lot of space and the calculation dimension is large.

“This is a good book with good content!”: [1, 2, 2, 1, 1, 1, 1, 1, 1].

The word bag model is a vector space model in which the number of words is represented in the corresponding position according to the word index to achieve the representation of the whole sentence.

4.2. Characteristics Analysis

In the text vector space model, the commonly used feature selection methods include the chi-square test, information gain, mutual information method, and TF-IDF. TF-IDF combines word distribution information between documents in the word bag model, and keywords are highlighted by calculating absolute term frequency (TF) and inverse document frequency (IDF).

Absolute word frequency (TF) represents the spelling of the feature item in the training text . Important words in a text are often emphasized multiple times, and absolute word frequency can easily highlight these words.

Calculation of IDF of inverted document frequency is shown in the following formula:

N represents the total number of documents in the training set, and ni represents the number of documents in which feature item appears in the training set. IDF highlights some words that appear less frequently but have strong classification ability. In the actual calculation process, IDF will carry out smooth processing in order to avoid the missing of rare words in the corpus.

4.3. Word2vec

Word2vec was proposed by Google in 2013. It is a way to represent words through dense feature representation, also known as distributed representation. There are two models for Word2vec training, namely, the Continuous Bag-of-Words (CBOW) model and Skip-Gram model. The improved methods of Word2vec are divided into two types, one is based on Hierarchical Softmax and the other is based on Negative Sampling, both of which are for simplifying computation and accelerating training. In Hierarchical Softmax, the output of projection layer under the CBOW model is the mean value of input word vector sum and the output of projection layer under the Skip-Gram model is the same as the input. In order to avoid calculating the probability of all words, the Hierarchical Softmax approach uses Huffman tree instead of Softmax mapping of the projection layer to output layer. Negative Sampling Word2vec is widely used in various natural language models, and word vector is also a pretraining method that can bring a neural network to a better training starting point and make the network easier to optimize [23].

Compared with the independent thermal coding, the dense feature is easy to calculate and does not have the problem of dimension explosion, which has a strong generalization ability. Dense feature representation can provide similarity information between features. This distributed representation of the word vector is widely used in natural language processing, such as Chinese word segmentation, sentiment analysis, and reading comprehension.

5. Construction of the Affective State Analysis Model

5.1. Introduction to the Model

Support vector machine (SVM) is a binary model algorithm that can also be used for text classification. The basic idea of a support vector machine is to find the hyperplane with the largest interval in the feature space. Its advantages are that it is effective in high-dimensional space and still has good effect when the number of dimensions is larger than the number of samples. Different kernel functions can be specified during the design of support vector machines. However, when the number of features is much larger than the number of samples, the performance of SVM is poor.

For the training data set T, T is and , stands for negative and positive labels, is a sample or sentence, and N is the number of samples. Optimization problems solved by support vector machines are shown in the following formula:

In formula (3), is the normal vector separating the hyperplane, is the relaxation variable, and is the mapping function. The dual form of the problem can be expressed as

In formula (4), stands for all vectors 1, is the upper bound of Lagrange everyday number, and is a semidefinite matrix of shape .

In formulas (3–5), is the kernel. The decision function Gram of the support vector machine is expressed as .

5.2. Training Process

In order to make the support vector machine output category probability, Platt Scaling was used in this paper. This is a parameterized method that uses the logistic regression model to fit the output values; that is, the Sigmoid function is used to map the values to between and finally the output values of the original model are mapped to probability values, as shown in the following formula:

In formula (3), is the decision function of the support vector machine, which can output corresponding labels to any input X, and parameter A and parameter B are trainable parameters.

The objective function of training is the loss of cross entropy, as shown in the following formulas:

By Platt Scaling, support vector machines can output probability values of categories. The basic idea is that the closer the points are to the interface, the less likely they are to match and the farther the points are from the interface, the more likely they are to match.

5.3. Training Results

Support vector machines generally adopt a linear kernel in text classification. The test results of the SVM model with a linear kernel as the kernel function and a word bag model as input are as follows: F1 value is 0.763, the accuracy is 76.81%, and the AUC value is 0.821. The test results of the SVM model with a linear kernel as the kernel function and TF-IDF as the input are as follows: F1 value is 0.795, the accuracy is 78.94%, and the AUC value is 0.863.

Compared with the word bag model, SVM with TF-IDF as feature input achieves better results under the current data set.

6. Conclusion

The main steps of building an emotion analysis model include emotion recognition process and data acquisition and preprocessing, emotional state analysis language processing, emotional state analysis model construction, and integration [24]. The effects of TF-IDF characteristics and the word bag model on the training results are analyzed and explored in the experiment. The results show that in a single model, the TF-IDF feature combined with the support vector machine achieves the optimal result. The stacking strategy and soft voting strategy were compared in the model integration, and the optimal performance was obtained by stacking with a support vector machine learner. This is the main innovation of this paper. The introduction of the methods used in each component is too superficial, which is the main shortcoming of this paper. In this paper, the design principle of a sentiment analysis model based on the support vector machine is studied. Based on the experimental data, the influence of attention mechanism on the neural network model is explored and the performance of traditional machine learning model is compared with different inputs. In the single model experiment, the support vector machine combined with TF-IDF achieved the best classification effect, with a F1 value of 0.795, an accuracy rate of 78.94%, and an AUC value of 0.863.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no conflicts of interest.