Abstract
Emotion recognition is very important for human-computer intelligent interaction. It is generally performed on facial or audio information by artificial neural network, fuzzy set, support vector machine, hidden Markov model, and so forth. Although some progress has already been made in emotion recognition, several unsolved issues still exist. For example, it is still an open problem which features are the most important for emotion recognition. It is a subject that was seldom studied in computer science. However, related research works have been conducted in cognitive psychology. In this paper, feature selection for facial emotion recognition is studied based on rough set theory. A self-learning attribute reduction algorithm is proposed based on rough set and domain oriented data-driven data mining theory. Experimental results show that important and useful features for emotion recognition can be identified by the proposed method with a high recognition rate. It is found that the features concerning mouth are the most important ones in geometrical features for facial emotion recognition.
1. Introduction
In recent years, there has been a growing interest in improving all aspects of the interactions between humans and computers. It is argued that to truly achieve effective human-computer intelligent interaction (HCII), there is a requirement for computers to be able to interact naturally with users, similarly to the way human-human interaction. HCII is becoming more and more important in such applications as smart home, smart office, and virtual reality, and it will be popular in all aspects of daily life in the future. To achieve the purpose of HCII, it is essential for computers to recognize human emotion and to give a suitable feedback. Consequently, emotion recognition attracts significant attention in both industry and academia. There are several research works in this field in recent years and some successful products such as AIBO, the popular robot dog produced by Sony. Usually, emotion recognition is studied by the methods of artificial neural network (ANN), fuzzy set, support vector machine (SVM), hidden Markov model (HMM), and based on the facial or audio features, and the recognition rate often arrives at 64% to 98% [1–3]. Although some progress has been made in emotion recognition, several unsolved issues still exist. For example, it is still an open problem which features are the most important for emotion recognition. It is a subject that was seldom studied in computer science. However, related research works have been conducted in cognitive psychology [4–6].
There have been several research works related to the important features for emotion in cognitive psychology. Based on the results of psychological experiments, Sui and Ren argue that the information conveyed by different facial parts has diverse effects on the facial expression recognition, and the eyes play the most important role [4]. Wang and Fu argue that the low spatial frequency information is important for emotion [5]. White argues that edge-based facial information is used for expression recognition [6].
In our previous works of emotion recognition in [7–10], attribute reduction algorithms based on classical rough set are used for the purpose of facial emotional feature selection, and SVM is taken as the classifiers. Some useful features concerning eyes and mouth are found. Based on these features, high correct recognition rates are achieved. However, classical rough set theory is based on equivalence relation. There must be a process of discretization in equivalence relation since the measured facial features are continuous values. Information might be lost or changed in the discretization process, thereby affecting the result. To solve this problem, some research works have been taken. Shang et al. proposed a new attribute algorithm, which integrates the discretion and reduction using information entropy-based uncertainty measures and evolutionary computation [11]. Jensen and Shen proposed a fuzzy-rough attribute reduction algorithm and an attribute reduction algorithm based on tolerance relation [12]. Although these research works can avoid the discretization process, the parameters in these methods should be given according to prior experience of domain experts, for example, the fuzzy set membership function in Jensen’s fuzzy-rough attribute reduction algorithm, the population amount for Shang’s method. If there is no experience of domain experts, these methods will be useless in some extent. In this paper, a novel feature selection method based on tolerance relation is proposed, which can avoid the process of discretization. Meantime, based on the idea of domain-oriented data-driven data mining (3DM), a method for finding suitable threshold of tolerance relation is introduced. Experimental results show that important and useful features for emotion recognition can be identified by the proposed method with a high recognition rate. It is found that the features concerning mouth are the most important ones in geometrical features for facial emotion recognition.
The rest of this paper is organized as follows. In Section 2, a novel feature selection method for emotion recognition based on rough set theory is introduced. Simulation results and discussion are given in Section 3. Finally, conclusions and future works are presented in Section 4.
2. Feature Selection for Emotion Recognition Based on Rough Set Theory
2.1. Basic Concepts of Rough Set Theory
Rough set (RS) is a valid mathematical theory for dealing with imprecise, uncertain, and vague information; it was developed by Professor Pawlak in 1980s [13, 14]. RS has been successfully used in many domains such as machine learning, pattern recognition, intelligent data analyzing, and control algorithm acquiring [15–17]. The most advantage of RS is its great ability of attribute reduction (knowledge reduction, feature selection). Some basic concepts of rough set theory are introduced here for the convenience of the following discussion.
Definition 2.1. A decision information system is defined as a quadruple , where is a finite set of objects, is the condition attribute set, and is the decision attribute set. For all , with every attribute , a set of its values is associated. Each attribute determines a function .
Definition 2.2. For a subset of attributes , an indiscernibility relation is defined by , in which and are values of the attribute a of x and y.
The indiscernibility relation defined in this way is an equivalence relation. Obviously, . By we mean the set of all equivalence classes in the relation . The classical rough set theory is based on an observation that objects may be indiscernible due to limited available information, and the indiscernibility relation defined in this way is an equivalence relation indeed. The intuition behind the notion of an indiscernibility relation is that selecting a set of attribute effectively defines a partition of the universe into sets of objects that cannot be discerned using the attributes in B only. The equivalence classes , induced by a set of attributes , are referred to as object classes or simply classes. The classes resulted from and are called condition classes and decision classes, respectively.
Definition 2.3. A decision information system is a continuous value information system, and it is defined as a quadruple , where is a finite set of objects, is the condition attribute set, and is the decision attribute set. For all , is continuous value attribute.
A facial expression information system is a continuous value information system according to Definition 2.3.
If a condition attribute value is a continuous value, indiscernibility relation cannot be used directly since it requires that the condition attribute values of two different samples are equal, which is difficult to satisfy. Consequently, a process of discretization must be taken, in which information may be lost or changed. The result of attribute reduction would be affected. Since all measured facial attributes are continuous value and imprecise to some extent, the process of discretization may affect the result of emotion recognition. We argue that it is suitable for the continuous value information systems that the attribute values are taken as equal if they are similar in some range. Based on this idea, a method based on tolerance relation that avoids the process of discretization is proposed in this paper.
Definition 2.4. A binary relation defined on an attribute set is called a tolerance relation if it satisfies(1)symmetrical: ;(2)reflexive: .
From the standpoint of a continuous value information system, a relation could be set up for a continuous value information system as follows.
Definition 2.5. Let an information system be a continuous value information system; a relation is defined as
Apparently, is a tolerance relation according to Definition 2.4 since is symmetrical and reflextive. In classical rough set theory, an equivalence relation constitutes a partition of , but a tolerance relation constitutes a cover of , and equivalence relation is a particular type of tolerance relation.
Definition 2.6. Let be a tolerance relation based on Definition 2.5, is called a tolerance class of , and is the cardinal number of the tolerance class of .
According to Definition 2.6, for all the bigger the tolerance class of is, the more uncertainty it will be and the less knowledge it will contain. On the contrary, the smaller the tolerance class of is, the less uncertainty it will be and the more knowledge it will contain. Accordingly, the concept of knowledge entropy and conditional entropy could be defined as follows.
Definition 2.7. Let , be a tolerance relation; the knowledge entropy of relation is defined as
Definition 2.8. Let and be tolerance relations defined on , a relation satisfying and simultaneous can be taken as , and it is a tolerance relation too. For all ; therefore, the knowledge entropy of can be defined as
Definition 2.9. Let and be tolerance relations defined on ; the conditional entropy of with respect to is defined as .
Let be a continuous value information system, let relation be a tolerance relation defined on its condition attribute set C , and let relation be an equivalence relation (a special tolerance relation) defined on its decision attribute set . According to Definitions 2.7, 2.8, and 2.9, we can get where the conditional entropy has a clear meaning; that is, it is a ratio between the knowledge of all attributes (condition attribute set plus decision attribute set) and the knowledge of the condition attribute set.
2.2. Feature Selection Based on Rough Set Theory and Domain-Oriented Data-Driven Data Mining
In this section, a novel attribute reduction algorithm is proposed based on rough set theory and domain-oriented data-driven data mining (3DM) [18, 19].
3DM is a data mining theory proposed by Wang [18, 19]. According to the theory, knowledge could be expressed in different ways; that is, some relationship exists between the different formats of the same knowledge. In order to keep the knowledge unchanged in a data mining process, the properties of the knowledge should remain unchanged during the knowledge transformation process [20]. Otherwise, mistake may occur in the process of knowledge transformation. Based on this understanding, knowledge reduction can be seen as a process of knowledge transformation, in which properties of the knowledge should be remained.
In the application of emotion recognition, no faces are entirely the same nor are emotions. For any two different emotion samples, there must be some different features in the samples. Accordingly, an emotion sample belongs to an emotion state according to its features which are different to the others. From this standpoint, we argue that the discernability of the condition attribute set with respect to the decision attribute set can be taken as an important property of knowledge in the course of knowledge acquisition in emotion recognition. Based on the idea of 3DM, the discernability should be unchanged in the process of knowledge acquisition and attribute reduction.
Definition 2.10. Let be a continuous value information system. If , it is certainly discernable for the continuous value information system .
The discernability is taken as a fundamental ability that a continuous information system has in this paper. According to 3DM, the discernability should be unchanged if feature selection is done for a continuous value information system. From Definition 2.10, we can have . Therefore, according to Definition 2.6, we can have . Accordingly, the discernability of a tolerance relation can be defined as follows.
Definition 2.11. Let be a tolerance relation according to Definition 2.5; if , has the certain discernability.
If has certain discernability, according to Definition 2.11, , therefore, .
Theorem 2.12. is a necessary and sufficient condition of that there is certain discernability for the condition attribute set with respect to the decision attribute set in tolerance relation.
Proof. Let  be a continuous value information system, let relation K  be a tolerance relation defined on condition attribute set C , and let relation L  be an equivalence relation (a special tolerance relation) defined on decision attribute set D.
Necessity
If there is certain discernability for the condition attribute set with respect to the decision attribute set in tolerance relation, according to Definition 2.11, , then
											
Sufficiency
For all , we can have . Since , we can have , , that is, . Therefore, the decision values should be equal for the different samples included in the same tolerance class. Accordingly, we can have , therefore, , and there is certain discernability for condition attribute set with respect to decision attribute set in tolerance relation. This completes the proof.
From Theorem 2.12, can be taken as a measurement for has certain discernability.
For a given continuous value information system S , there could be many different tolerance relations according to different threshold under the condition , but the biggest granular and the best generalization are always required for knowledge acquisition. According to the principle, we can have the following results.
If the threshold in tolerance relation is 0, then the tolerance class of an instance contains itself only, and we can have , and . It is the smallest tolerance class for the tolerance relation, the smallest knowledge granular, and the smallest generalization.
If the threshold in tolerance relation is increased from 0, both and are increased. If , then, , , and the granular of knowledge is increased.
If the threshold in tolerance relation is increased to a critical point named , both and are increased, and , , and the granular of knowledge is the biggest under the condition that the certain discernability of condition attribute set with respect to decision attribute set in tolerance relation is unchanged.
If the threshold in tolerance relation is increased from and , then , , , and then the certain discernability is changed. If , then , , and . Therefore Since is held and is increased with the threshold of increase, is increased.
If the threshold in tolerance relation is increased to , then and , , so, Since the equivalence class of Q is held, is constant.
The relationship between entropy, condition entropy and can be shown in Figure 1.

| (a) Relationship between and | 

| (b) Relationship between and | 
From Figure 1 and the discussion above, if the threshold value of take , it will make , and therefore, the certain classification ability of condition attribute set with respect to decision attribute set will be unchanged. At the same time, the tolerance class of is the biggest. In a sense, the knowledge granular is the biggest in , and then, the generalization should be the best.
In summary, parameter selection of is discussed, and based on 3DM, a suitable threshold value of , , is found. It can keep the classification ability of condition attribute set with respect to decision attribute set, and at the same time, it can keep the generalization the most. It is predominant for the course of finding since the method is based on data only and dose not need experiences of domain experts. Therefore, the method is more robustness.
In this paper, the threshold of is searched in based on binary search algorithm.
2.3. Attribute Reduction for Emotion Recognition
The discernability of condition attribute set with respect to decision attribute set in tolerance relation is a fundamental feature of knowledge of a continuous value information system. The discernability should be unchanged according to 3DM. Since is a necessary and sufficient condition for keeping the discernability of condition attribute set with respect to decision attribute set in tolerance relation, therefore, a self-learning attribute reduction algorithm (SARA) is proposed for continuous value information systems as follows.
Algorithm 2.13 (Self-learning attribute reduction algorithm (SARA)). Input: a decision table  of a continuous information system, where  is a finite set of objects,  is the condition attribute set, and  is the decision attribute set.
Output: a relative reduction B  of S.
Step 1. Compute , then set up a tolerance relation on the condition attribute set C.
Step 2. Compute condition entropy .
Step 3. For all , compute . Sort  according to  descendant.
Step 4. Let , deal with each  as in the following.
Substep 4.1
Compute .
Substep 4.2
If , attribute  should be reduced, and , otherwise,  could not be reduced, and B is holding.
Let . The time complexity of Step 1 is , the time complexity of Step 2 is , the time complexity of Step 3 is , the time complexity of Step 4 is , and therefore, the time complexity of the algorithm is .
3. Experiment Results and Discussion
Since there are few open facial emotional dataset, three facial emotional datasets are used in the experiments. The first dataset comes from the Cohn-Kanade AU-Coded Facial Expression (CKACFE) database [21], and the dataset is more representative of Caucasian to some extent. The second one is the Japanese female facial expression (JAFFE) database [22], and it is more representative of Asian women. The third one named CQUPTE [23] is collected from 8 graduate students in the Chongqing University of Posts and Communications in China, in which there are four females and four males. Details of the datasets are listed in Table 1.
Some samples are shown in Figure 2. In each dataset, the samples are happiness, sadness, fear, disgust, surprise, and angry from left to right in Figure 2.

(a) Some images of CKACFE database

(b) Some images of JAFFE database

(c) Some images of CQUPTE database
Facial expression of human being is expressed by the shape and position of facial components such as eyebrows, eyes, mouth, and nose. The geometric features, appearance features, wavelet features, and mixture features of facial are popular for emotion recognition in recent years. The geometric facial features represent the shape and locations of facial components, and it is used in the experiments since it is obvious and intuitionistic for the facial expression. The geometric facial features are the distance between two different feature points which are according to a defined criterion. The MPEG-4 standard is a popular standard for feature point selection. It extends facial action coding system (FACS) to derive facial definition parameters (FDP) and facial animation parameters (FAP). There are 68 FAP parameters, in which 66 low parameters are defined according to FDP parameters to describe the motion of a human face. The FDP and low-level FAP can constitute a concise representation of a face, and they are adequate for basic emotion recognition because of the varieties of expressive parameter. In the experiments, 52 low FAP parameters are chosen to represent emotion because some FAP parameters have little effect on facial expression. For example, the FAP parameter named raise_l_ear, which denotes the vertical displacement of left ear. Thus, a feature point set including 52 feature points is defined as shown in Figure 3. Based on the feature points, 33 facial features are extracted for emotion recognition according to [4–7] and listed in Table 2. The 33 facial features can be divided into three groups. There are 17 features in the first group which concern eyes and consists of , and ; there are 6 features in the second group which concern cheek and consists of , and ; there are 10 features in the third group which concern mouth and consists of and . In Table 1, A is the midpoint of point 19 and 23, and B is the midpoint of point 27 and 31. denotes the Euclid distance between point and ; denotes the horizontal distance between point and ; denotes the vertical distance between and . Since the distance between point 23 and 27 is stable for all kinds of expression, we normalize the distance features in the following way.

Firstly, , is the distance between point 23 and 27.
Secondly, the normalized distance is calculated using the following formula:
3.1. Experiments For SARA as a Feature Selection Method for Emotion Recognition
In this section, there are five comparative experiments to test the effectiveness of SARA as a method of feature selection for emotion recognition.
In the first experiment, SARA is taken as the method of feature selection for emotion recognition. In the second one, an attribution reduction algorithm named CEBARKNC [24] is taken as a method of feature selection for emotion recognition. CEBRKNC is selected in this comparative experiment since it is an attribute reduction algorithm based on conditional entropy in equivalence relation. In this experiment, a greedy algorithm proposed by Nugyen [25] is taken as a discretization method, and it is done on the platform RIDAS [26]. In the third experiment, an attribute reduction algorithm named MIBARK [27] is taken as a method of feature selection. It is a reduction algorithm based on mutual-information as the measure of importance of attribute. And a greedy algorithm proposed by Nugyen [25] is taken as a discretization method, and it is done on the platform RIDAS also. In the fourth experiment, a traditional feature selection method, Genetic Algorithm (GA) [28], is used as the feature selection method for emotion recognition. This experiment is done on WEKA [29], a famous machine learning tool, and CfsSubsetEval is taken as the evaluator for feature selection in WEKA. In the fifth experiment, all the 33 features are used for emotion recognition, and the feature selection course is omitted, SVM is a new machine learning method, and it is famous for its great ability for small samples applications. Therefore, SVM are taken as classifiers for all the comparative experiments. SVM are given same parameters in all the experiments. 4-fold cross-validation is taken for all the experiments.
The results of the comparative experiments are shown in Table 3. CRR is the percentage of the correct recognition rate, and RAN is the number of attributes after attribute reduction.
From the experiment results of SARA + SVM and SVM from Table 3, we can find that SARA can use nearly one third features and get nearly the same correct recognition rate; therefore, SARA can be taken as a useful feature selection method for emotion recognition. When we compare the experimental results of SARA + SVM and CEBARKNC + SVM from Table 3, we can find SARA selects as much features as CEBARKNC, but SARA gets a better correct recognition rate than CEBARKNC. Furthermore, from the comparative experiment results between SARA + SVM and MIBARK + SVM, or experimental results between SARA + SVM and GA + SVM from Table 3, we can find that SARA can use fewer features than MIBARK or GA but get higher recognition rate. Therefore, SARA can be taken as an effective feature selection method for emotion recognition than CEBARKNC, MIBARK, and GA, since the features selected by SARA have better discernability in emotion recognition.
Common features reserved by the four feature selection methods are listed in Table 4.
From Table 4, we can find that the four feature selection algorithms can select different features for emotion recognition. Among all the experiment results, SARA selects three common features, , and for all the three emotion datasets, meanwhile, CEBARKNC selects two common features, and , and MIBARK selects two common features, and ; however, GA cannot find any common feature for all the three datasets. Since better correct recognition rate can be achieved if SARA is used as a method of feature selection for emotion recognition, therefore, can be seen more important for emotion recognition. Although the features of are normalized features, the importance of original features of is also evident. Since the features of are all concerning mouth, therefore, we can draw a conclusion that the geometrical features concerning mouth are the most important features for emotion recognition. The original selected features of SARA, CEBARKNC, and MIBARK are shown in Figure 4.

(a) Common features selected by SARA

(b) Common features selected by CEBARKNC

(c) Common features selected by MIBARK
3.2. Experiments for the Features Concerning Mouth for Emotion Recognition
From the last section, we draw a conclusion that the geometrical features concerning mouth are important for emotion recognition. In this section, there are four experiments for the purpose of testing the importance of the geometrical feature concerning mouth for emotion recognition. In the first experiment, all the 33 facial features are used for emotion recognition. In the second experiment, only the features selected by SARA are used for emotion recognition. In the third experiment, all the features concerning mouth are deleted, and there are 19 features that are used for emotion recognition, in which there are 17 features concerning eyes , and two features concerning cheek but not mouth. In the fourth experiment, all the features concerning eyes are deleted, and there are 12 features that are used for emotion recognition, in which there are 10 features concerning mouth , and two features concerning cheek but not eyes. SVM is taken as classifier in the four experiments and is given the same parameters. Experiment results are listed in Table 5.
From Table 5, we can find that the correct recognition rate is decreased greatly if there is no feature concerning mouth. Therefore, it is concluded that the features concerning mouth are the most important geometrical features for emotion recognition. On the other hand, we can find that the correct recognition rate is not affected so much if there are no features concerning eyes. Therefore, the geometrical features concerning eyes do not play an important role in emotion recognition. But from the psychological experiments of [4], Sui and Ren found that the eyes play an important role in emotion; therefore, we may draw a conclusion that the geometrical features concerning mouth are the most important in the geometrical features for emotion recognition, and the geometrical features concerning eyes are not so important. Furthermore, the important features concerning eyes for emotion recognition should be discovered and used in emotion recognition in the further work. Meanwhile, we can find that the correct recognition rate is decreased in CKACFE more than in JAFFE and CQUPTE. Therefore, we can draw a conclusion that the geometrical features concerning mouth are more important for emotion expression for the Caucasian than the eastern people.
4. Conclusion
In this paper, based on rough set theory and the idea of domain oriented data driven data mining, a novel attribute reduction algorithm named SARA is proposed for feature selection for emotion recognition. The proposed method is found to be effective and efficient, and the geometrical features concerning mouth are found to be the most important geometrical features for emotion recognition.
Acknowledgment
This paper is partially supported by the National Natural Science Foundation of China under Grants no. 60773113, Natural Science Foundation of Chongqing under Grants no. 2007BB2445, no. 2008BA2017, and no. 2008BA2041.