Fuzzy Set-Valued Information Systems and the Algorithm of Filling Missing Values for Incomplete Information Systems

Wang, Zhaohao; Zhang, Xiaoping

doi:https://doi.org/10.1155/2019/3213808

Complexity

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2019 | Article ID 3213808 | https://doi.org/10.1155/2019/3213808

Fuzzy Set-Valued Information Systems and the Algorithm of Filling Missing Values for Incomplete Information Systems

Zhaohao Wang¹and Xiaoping Zhang¹

Academic Editor: Lingzhong Guo

Received13 Jul 2019

Accepted05 Nov 2019

Published10 Dec 2019

Abstract

How to effectively deal with missing values in incomplete information systems (IISs) according to the research target is still a key issue for investigating IISs. If the missing values in IISs are not handled properly, they will destroy the internal connection of data and reduce the efficiency of data usage. In this paper, in order to establish effective methods for filling missing values, we propose a new information system, namely, a fuzzy set-valued information system (FSvIS). By means of the similarity measures of fuzzy sets, we obtain several binary relations in FSvISs, and we investigate the relationship among them. This is a foundation for the researches on FSvISs in terms of rough set approach. Then, we provide an algorithm to fill the missing values in IISs with fuzzy set values. In fact, this algorithm can transform an IIS into an FSvIS. Furthermore, we also construct an algorithm to fill the missing values in IISs with set values (or real values). The effectiveness of these algorithms is analyzed. The results showed that the proposed algorithms achieve higher correct rate than traditional algorithms, and they have good stability. Finally, we discuss the importance of these algorithms for investigating IISs from the viewpoint of rough set theory.

1. Introduction

The classical rough model [1] can be used to deal with complete information systems. In practice, the lack of some data in IISs [2–9] is inevitable. For example, because the data collection process may be imperfect, human or objective conditions result in data loss or unavailability. For data mining, these missing data may have a very important impact on final decision. Therefore, how to infer unknown information from known information has important theoretical and practical significance.

Kryszkiewicz [10] defined tolerance relation in IISs to investigate IISs by using rough set approach. This tolerance relation assumed that the missing attribute values in IISs could be represented by a set of all possible values of the corresponding attributes from an optimistic perspective. Based on Kryszkiewicz’s research, Leung and Li [11] presented a method for obtaining the relative reduction in IISs. Subsequently, Stefanowski and Tsoukias [9] established a new rough set model based on the other relations in IISs. Authors [8, 11–17] gave different methods to induce binary relations from IISs, and studied IISs by means of rough set theory. They had two main ways to treat the missing values. One was to delete the missing values, and the other was to take the missing values as generic values.

Based on the probability theory, Yuan et al. [18] filled the missing values in IISs by obtaining the sample that is the closest to the missing data sample in terms of Euclidean distance and correlation. Chen and Shao [19] used the Jackknife variance estimate to investigate the missing values. In addition, there are other methods to handle missing values in IISs. Wang et al. [20] addressed the missing values in IISs by means of the Hopfield neural network approach. Salama et al. [21] proposed a topology method to retrieve missing values in IISs. Clearly, these methods of filling missing values were founded through the other theories, such as, neural network and topology. In this paper, we establish a new method to fill missing values by means of rough set theory. Next, we state the motivation of giving this method. We know that the indiscernibility relation is a basic concept in rough set theory. Given a complete information system, we can establish an indiscernibility relation. Two objects are viewed as indiscernible if they have the same values for each attribute. Therefore, we think that if two objects possess more the same values of attributes, then they have the higher degree of indiscernibility. Based on the observation, we provide a method to fill missing values. By using this method, we can convert the missing values into fuzzy set values by evaluating the relationship between the attribute values of different objects, and then we can transform fuzzy set values into set values or real values according to the principle of maximum membership degree in fuzzy set theory. It is worth noting that, in order to construct this method, we established a new information system, namely, the fuzzy set-valued information system (FSvIS) which plays an important role in the method.

The rest of this paper is organized as follows. In Section 2, some basic concepts and notations of rough sets and fuzzy sets are given. In Section 3, we propose the fuzzy set-valued information system (FSvIS), and we induce some binary relations from FSvISs. Furthermore, we investigate the connections between these binary relations. In Section 4, we provide two methods of filling missing values. One is to fill missing values with fuzzy set values, and the other is to fill missing values with set values (or real values). In Section 5, we perform several experiments to analyze the effectiveness of the proposed methods. In Section 6, we apply the proposed methods of filling missing values to investigate IISs. Section 7 concludes this paper.

2. Basic Concepts and Properties

In this section, we review some basic concepts and notations in rough sets and fuzzy sets.

2.1. Basic Concepts for Rough Sets

In this subsection, we review some basic concepts related to general binary relations and information systems [22–24].

Definition 1 (see [23]). A general binary relation on a nonempty set U is a subset of . R is called(1)Reflexive, if for any , (2)Symmetric, if for any , implies (3)Transitive, if for any , and imply Generally, if R satisfies reflexive and symmetric, it is called a similarity relation; if R satisfies reflexive, symmetric, and transitive, then it is called an equivalence relation.
Let R be a general binary relation on U, for , and the successor neighbourhood of x with respect to R is defined byA triple is called an information system, where U is a finite nonempty set of objects called the universe, is a finite nonempty set of attributes, and , where called the domain of a is a nonempty set of values of attribute . If there exist and such that the value of x under a is a missing value (a null of unknown value), denoted as “,” that is, , , then the information system is called an incomplete information system ().
In order to investigate the by using rough set approach, Kryszkiewic [13] presented a way to induce a relation in the as follows for :It is easy to check that is reflexive and symmetric, that is to say, is a similarity relation on U.
In this paper, we call a generalized approximation space, where R is a binary relation on a finite nonempty set U.

Definition 2 (see [1]). Given a generalized approximation space and , the lower approximation and upper approximation of X are defined as follows:In [23], Wang et al. constructed an uncertainty measure in generalized approximation spaces, which is defined as follows:

Definition 3. Let be a generalized approximation space. The entropy of R is defined as follows:

Proposition 1 (see [23]). Let and be binary relations on U. If , then .

2.2. Basic Concepts for Fuzzy Sets

In this section, we introduce some basic concepts and measures about fuzzy sets.

A fuzzy subset A of a nonempty set U is a map from U to [25]. The collection of all fuzzy subsets of U is denoted as . Similarity measure is an important concept in fuzzy set theory, and it is defined as follows:

Definition 4 (see [26]). A function is called a similarity measure on , if S satisfies the following properties:(1) and for all (2) for all (3)For all , , then and Particularly, a similarity measure S is called a strictly similarity measure if it also satisfies(4) if and only if , for all Let and The most popular similarity measures include:(1)Hamming similarity measure [27]:(2)Euclidean similarity measure [27]:(3)Max-min similarity measure [28]:

Remark 1. In this paper, we always assume that .

3. Fuzzy Set-Valued Information Systems (FSvISs)

In this section, we replace the real number in the real-valued information system with fuzzy set and propose a more general information system, that is, the fuzzy set-valued information system. It can be seen as a generalization of the probabilistic set-valued information system defined by Huang et al. [29].

Definition 5. A fuzzy set-valued information system (FSvIS) is a triple , where U is a nonempty set, is a set of attributes, and V is the basic set of attribute values. In addition, for all and , the value of x under a is a fuzzy subset of V, that is, .
In some cases, if the attribute values are uncertain or missing, then it is reasonable to describe them with fuzzy set values. For example, in IISs, we may fill the missing values with fuzzy set values. In this paper, we will investigate IISs by means of FSvISs.

Example 1. Table 1 gives a FSvIS , where , , and . In Table 1, represents the value of the object under attribute . is the grade of membership of in .

3.1. The Similarity Relations in FSvISs

The rough set approach is applied for rule extractions and attribute reductions in information systems. The key problem is how to construct binary relations from information systems. Next, we will establish some similarity relations in FSvISs. Then, we establish the relationships between them.

It is well known that, in fuzzy set theory, similarity measure is an important concept to evaluate the similarity degree between fuzzy sets.

Let be a FSvIS, S be a similarity measure and . There is a common method to construct binary relation in terms of similarity measure as follows:where . Clearly, is a binary relation on U. The successor neighbourhood of can be computed as follows:

In the following section, we limit .

By (1) of Definition 4, is reflexive. In addition, the symmetry of is clear. Therefore, the following result is obvious.

Proposition 2. Let be a FSvIS, S be a similarity measure, , and . Then, the binary relation is reflexive and symmetric.
Proposition 2 shows that is a similarity relation.

Remark 2. By equations (5)–(8), we can obtain three similarity relations: , , and .

Proposition 3. Let be a FSvIS, S be a similarity measure, , and . The following statements hold:(1)If , then (2)If , then

Proof. (1)We only need to prove that , . , by equation (9), we have that , . By , it is clear that , . It follows from equation (7) that . Hence, . Consequently, .(2)We only need to prove that , . , by equation (9), we have that , . By , it is clear that , . It follows from equation (8) that . Hence, . Consequently, .In the following, we establish the relationships among , , and . Firstly, we provide the connections among the similarity measures given by equations (5)–(7).

Proposition 4. Let U be a nonempty set. The following statements hold:(1), (2),

Proof. (1)We may assume that . Let . It is easy to verify that that is, In addition, it is clear that Therefore, by equations (11) and (12), we have that Thus, By Remark 1, . Therefore, By equations (5) and (7), we conclude that .(2)Let . Next, we will use mathematical induction to prove . If , it is clear that , which implies that is true. Assume that is true when . By equations (5) and (6), we have that where . This implies that Next, we shall prove that the conclusion is true when . By equation (5) and (6), we only need to prove that that is, For simplicity, we write . Hence, we only need to prove that In addition, equation (17) can be written by By equation (21), it is clear that This completes the proof.

According to Proposition 4 and equation (8), the following result is obvious.

Theorem 1. Let be a FSvIS, and . Then, the following statements hold:(1)(2)

3.2. The Uncertainty Measures of FSvISs

In Section 3.1, we establish three similarity relations in FSvISs. If we use the rough set approach to investigate FSvISs, we usually need to choose reasonable similarity relations according to the actual condition. Therefore, in this section, we discuss the uncertainty measures of these similarity relations so as to provide evidence for the choice of similarity relations.

Proposition 5. Let be a FSvIS, and . The following statements hold:(1), and (2), and

Proof. It is straightforward from Theorem 1 and Definition 2.

Proposition 6. Let be a FSvIS, and . The following statements hold:(1)(2)

Proof. It is straightforward from Theorem 1 and Proposition 1.

4. Algorithms of Filling Missing Values in IISs

We know that complete information systems can be investigated by the rough set approach. In general, in order to discuss an IIS by means of rough set theory, we need to fill missing values in the IIS. That is to say, we first need to transform the IIS into a complete information system. In this section, we provide some methods to fill missing values in IISs. Note that data are often divided into two types: discrete data and continuous data. Next, we study the issue of filling missing data under two cases.

4.1. Algorithm of Filling Missing Values in IISs of Discrete Data

Clearly, the missing values possess the property of uncertainty; therefore, it is reasonable to use fuzzy set values (or set values) to fill missing values in IISs. In this section, we provide two schemes, namely, replacing the missing values with fuzzy set values and replacing the missing values with set values.

4.1.1. Filling the Missing Values with Fuzzy Set Values

Next, we provide a method to fill missing values in IISs of discrete data. We replace the missing values with fuzzy set values. In fact, this method can transform IISs into FSvISs.

In the IIS given by Table 2, the value domain of is , and the value of under attribute is the missing value, that is, . We think that this missing value may be L or H or N. We cannot determine which one is , but we can find a way to evaluate the degree that L (or H or N) is . That is, we can replace the missing values with fuzzy sets on . Next, we outline the main idea of filling missing data. The indiscernibility relation is a basic concept in rough set theory. Given a complete information system, we can establish an indiscernibility relation. Two objects are viewed as indiscernible if they have the same values for each attribute. Therefore, we think that if two objects possess more the same values of attributes, then they have the higher degree of indiscernibility. For example, in Table 2, . and have the same values of five attributes ; and have the same values of two attributes . Thus, and have the higher degree of indiscernibility. That is to say, the possibility degree of is more than that of . Based on this observation, we obtain Algorithm 1.

Remark 3. In Step 2 of Algorithm 1, describes how many attributes for and have the same value. Thus, it can be used to characterize the degree of indiscernibility of and . In Step 3, can be considered as probability of the elements whose attribute values are t in U.

Example 2. In Table 3, . Clearly, . Step 1: Take . It is easy to compute that Step 2: We can compute that Thus, . Step 3: .Similarly, we can compute that and . Therefore, we fill the missing value with the following fuzzy set:

4.1.2. Filling the Missing Values with Set Values

Based on the discussion of Section 4.1.1, we can replace a missing value with a fuzzy set. In fact, we can transform the fuzzy set into a set by means of the maximum membership degree law. Let be an IIS of discrete data. Assume that , where and . By Algorithm 1, we obtain the fuzzy set . Thus, we can use the following set to fill the missing value :

	Let be an IIS of discrete data. Assume that , where and . denotes the set , that is, . We shall use a fuzzy set of to represent the missing value , and we denote the fuzzy set by . Thus, , we need to compute the membership degree . Next, we establish the steps of filling the missing value as follows:
	Step 1: , compute
	Step 2: Compute where
	Step 3: Assign a value to

Example 3. In Example 2, we obtain that . Thus, the maximal membership degree M is , that is, . By equation (25), we have that . That is to say, we can fill the missing value with the set . In Table 4, we know that should be N. This coincides with the filling values by our algorithm.
In Table 3, and are also missing. By Algorithm 1, we can obtain thatThus, we have that and .

4.2. Algorithm of Filling Missing Values in IISs of Continuous Data

Similar to the discussion of Section 4.1, we investigate the corresponding issues of IISs of continuous data in this section.

4.2.1. Filling the Missing Values with Fuzzy Set Values

Similar to Algorithm 1, we give Algorithm 2 to fill the missing value in IISs of continuous data.

Example 4. In this example, we discuss the Iris information system given by Table 5 from UCI. Suppose that and in Table 5 are missing. We obtain Table 6. Next, we use the IIS given by Table 6 to illustrate Algorithm 2.
In Table 6, . Clearly, . We take the thresholds and . Step 1: Take . It is easy to compute that . Step 2: Since , , and , it follows that , and thus . This implies that . Step 3: .Similarly, we can compute that , , , , , and . Therefore, we fill the missing value with the following fuzzy set:

	Let be an IIS of continuous data. Assume that , where and . denotes the set . We shall use a fuzzy set of to represent the missing value , and we denote the fuzzy set by . Thus, , and we need to compute the membership degree . Next, we establish the steps of filling the missing value as follows:
	Step 1: , compute , where is a threshold on
	Step 2: Compute where
	Step 3: Assign a value to ,

4.2.2. Filling the Missing Values with Real Values

Based on the discussion of Section 4.2.1, we can replace a missing value with a fuzzy set. Clearly, we can transform the fuzzy set into real value by means of the maximum membership degree law. Let be an IIS of continuous data. Assume that , where and . By Algorithm 2, we obtain the fuzzy set . Thus, we can use the following real value to fill the missing value :where and .

Example 5. From Example 4, we know thatBy equation (28), we compute thatThus, we obtain that .

Remark 4. By Table 5, we know that should be 5.1. By Example 5, we fill the value under the assumption that and are missing. The deviation of and is within 0.2. This indicates that the method of filling missing value is effective.

5. Experiments and Effectiveness Analysis

In this section, we employ several experiments to show the effectiveness of the algorithms given by Section 4. We compare the proposed methods with a representative algorithm. The summary information of experimental datasets is shown in Table 7. Adult dataset and Abalone dataset are taken from UCI (http://archive.ics.uci.edu/ml/datasets.php).

5.1. Effectiveness Analysis of the Algorithm of Filling Missing Values in IISs of Discrete Data

In this part, we will conduct two groups of experiments. They are used to compare the effectiveness of methods of filling missing values from different points of view. Frequency Estimator-based filling method (Algorithm FE) [30] is a common method of filling missing data. In this section, a comparison of the proposed methods with Algorithm FE is given.

In Section 4.1, we give the method of filling the missing values with fuzzy set values. Furthermore, we obtain the method of filling the missing values with set values. By combining Section 4.1.1 and Section 4.1.2, we design Algorithm FMvSV to fill the missing values with set values (Algorithm 3).

Next, we provide a comparative study of the effectiveness for Algorithms FMvSV and FE. We first give a quantitative index of the effectiveness for filling missing values as follows.

	Input: An IIS of discrete data, where and
	The missing values:
	Output: The filling set values:
(1)	for i = 1 to q do
(2)	for every t in do
(3)	Compute
(4)	for every x in do
(5)	Compute
(6)	end
(7)	Compute
(8)	if then
(9)
(10)	else
(11)
(12)	end if
(13)	end
(14)
(15)
(16)	end

	Input: An IIS of continuous data, where and .
	The missing values: . The threshold: .
	Output: The filling real values: .
(1)	for i = 1 to q do
(2)	for every t in do
(3)	Compute
(4)	for every x in do
(5)	Compute
(6)	end
(7)	Compute
(8)	if then
(9)
(10)	else
(11)
(12)	end if
(13)	end
(14)	;
(15)
(16)	end

Definition 6. Given a complete information system of discrete data, suppose that the values are missing, and the filling set values are denoted as , respectively. Then, the correct rate of filling values is defined bywhere q is the number of the missing values, and

Example 6. In Table 4, , and . In Example 3, suppose that , , and in Table 4 are missing. Then, we obtain that , and . Thus, by Definition 6, we can compute thatTherefore, the correct rate of filling values is .
In this section, we use some subsets of Adult dataset (see Table 7) to experiment. We need to experiment with the discrete value. We randomly select some subsets of discrete values in Adult dataset. Table 8 gives three subsets of Adult dataset.

Experiment 1. The effects of experiment times on correct rates of filling values.
In this experiment, we mainly compare the efficiency of Algorithms FMvSV and FE by the dataset AD200 in Table 8. The steps are as follows:(i) attribute values are randomly selected from AD200 and supposed that they are missing(ii)By means of Algorithms FMvSV and FE, we can fill these missing values, and we can obtain the correct rate of every algorithmThe steps (i) and (ii) are repeated ten times, and the corresponding results are summarized in Table 9. Similarly, we also consider the cases of , and missing values in AD200. The results are shown in Figures 1 and 2.
Figures 1 and 2 show the following facts:(i)The correct rate of Algorithm FMvSV is higher than that of Algorithm FE.(ii)The number of missing values is given, but the missing values in AD200 are not necessarily the same in each experiment. The correct rates of Algorithm FMvSV have little change in each experiment when the number of missing values in AD200 is a fixed value. However, in the similar case, the correct rates of Algorithm FE fluctuate obviously in each experiment. This indicates that Algorithm FMvSV does well in stability.In Table 9, the mean value of ten correct rates related to Algorithm FMvSV can be considered as the correct rate of Algorithm FMvSV for missing values. The correct rate of Algorithm FE for missing values can be obtained similarly. Furthermore, for , and missing values in AD200, we also compute the correct rates of Algorithms FMvSV and FE. The results are shown in Table 10 and Figure 3.
Table 10 and Figure 3 show that the correct rates of Algorithm FMvSV monotone decrease with the increase of missing values. However, the monotonicity of the correct rates of Algorithm FE is not obvious. In addition, Table 10 and Figure 3 also indicate that the effect of Algorithm FMvSV is better than that of Algorithm FE. Furthermore, when the missing values are increased to , the correct rate of Algorithm FMvSV still achieves .

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)

Experiment 2. The effects of data size on correct rates of filling values.
In this experiment, we use AD200, AD400, and AD800 to discuss the effects of data size on correct rates of Algorithms FMvSV and FE. For missing values, similar to the calculating method of Table 10, we can obtain the correct rate of Algorithm FMvSV (or FE) in this experiment. The results are shown in Figure 4.
Figure 4 reflects the following facts:(i)When the data size increases under the same missing rate, the correct rate of Algorithm FMvSV remains basically the same and is higher than that of Algorithm FE. Therefore, for Algorithm FMvSV, we can divide a dataset into several small datasets, and then fill missing values to improve efficiency of it.(ii)It is easy to see that as the data size increases, the difference between the correct rates of Algorithms FMvSV and FE becomes larger. This illustrates that as the data size increases, the advantage of Algorithm FMvSV is obvious, that is, Algorithm FMvSV has an advantage in processing big dataset and dynamically increasing dataset.

(a)

(b)

(c)

5.2. Effectiveness Analysis of the Algorithm of Filling Missing Values in IISs of Continuous Data

In this section, we also conduct two groups of experiments. They are still used to compare the effectiveness of algorithms of filling missing values from different points of view. Mean-based filling method (Algorithm MEAN) [31] is a common method of filling missing data for an IIS of continuous data. Next, a comparison of the proposed methods with Algorithm MEAN is provided.

In Section 4.2, we obtain the method of filling the missing values with real values. By combining Section 4.2.1 and Section 4.2.2, we design Algorithm FMvRV to fill the missing values with real values (Algorithm 4).

Next, we provide a comparative study of the effectiveness for Algorithms FMvRV and MEAN. We first give a quantitative index of the effectiveness for filling missing values as follows.

Definition 7. Given a complete information system of continuous data, suppose that the values are missing, and the filling set values are denoted as , respectively. Then, the correct rate of filling values is defined bywhere q is the number of the missing values, and , , …, and are thresholds corresponding to , , …, and .

Example 7. From Example 5, we know that the thresholds are , and . Similarly, we can compute that . It is clear that and . Therefore, we can calculate the correct rate: .
In this section, we use some subsets of Abalone dataset (see Table 8) to experiment. Table 11 gives three subsets of Abalone dataset.

Experiment 3. The effects of experiment times on correct rates of filling values about continuous data.
In this experiment, we mainly compare the efficiency of Algorithms FMvRV and MEAN by AB200 in Table 11. The steps are as follows:(i) attribute values are randomly selected from AB200 and supposed that they are missing(ii)By means of Algorithms FMvRV and MEAN, we can fill these missing values, and we can obtain the correct rate of every algorithmThe steps (i) and (ii) are repeated ten times, and the corresponding results are summarized in Table 12. Similarly, we also consider the cases of , , , , and missing values in AB200. The results are shown in Figure 5.
Figure 5 shows that Algorithm FMvRV is more stable than Algorithm MEAN. Furthermore, the correct rate of Algorithm FMvRV is better than that of Algorithm MEAN. This indicates that Algorithm FMvRV can carry out more accurate forecast of missing values. It is meaningful to explore the correct classification of incomplete datasets.
In Table 12, the mean value of ten correct rates related to Algorithm FMvRV can be viewed as the correct rate of Algorithm FMvRV for missing values. The correct rate of Algorithm MEAN for missing values can be computed similarly. Furthermore, for , , , , and missing values in AB200, we also compute the correct rates of Algorithms FMvRV and MEAN. The results are shown in Figure 6.
Figure 6 shows that the correct rates of Algorithm FMvRV monotone almost decrease with the increase of missing values. However, the monotonicity of the correct rates of Algorithm MEAN is not obvious. In addition, Figure 6 also indicates that the effect of Algorithm FMvRV is better than that of Algorithm MEAN. Furthermore, when the missing values are increased to , the correct rate of Algorithm FMvRV is more than . However, now, the correct rate of Algorithm MEAN is less than . This indicates that Algorithm FMvRV is more conducive to predicting the missing values.

(a)

(b)

(c)

(d)

(e)

(f)

Experiment 4. The effects of data size on correct rates of filling values about continuous data.
In this experiment, we use AB200, AB400, and AB800 to discuss the effects of data size on correct rates of Algorithms FMvRV and MEAN. For missing values, similar to the calculating method of Table 12, we can obtain the correct rate of Algorithm FMvRV (or MEAN) in this experiment. The results are shown in Figure 7.
Figure 7 reflects the following facts:(i)When the data size increases, the correct rate of Algorithm FMvRV is higher than that of Algorithm MEAN.(ii)When the missing values are less than , the correct rates of Algorithm FMvRV are almost unchanged and close to . Now, the data size has little effect on the correct rates of Algorithm FMvRV. This illustrates that Algorithm FMvRV has obvious advantages in processing big dataset when the missing values are less than .

(a)

(b)

(c)

6. Application of the Algorithms of Filling Missing Values in Investigating IISs

When we apply the rough set approach to investigate an IIS, a key step is to induce a binary relation from the IIS. For an IIS, we can provide three ways to obtain a binary relation from the IIS. Let be an and . Then, the three ways are as follows:(1)By equation (2), we can obtain the binary relation .(2)By Algorithm 1, we can fill the missing values in IISs with fuzzy set values. Then we can also view the other values of attributes as fuzzy set values, for example, in Table 3 of Example 2, , we can see as the fuzzy set value . Based on this discussion, we can transform an IIS into a FSvIS. Thus, according to equation (8), we can obtain the binary relation .(3)In an IIS of discrete data, if the value of x under attribute a is not missing, we can view as a set value . Based on this consideration, we can use Algorithm FMvSV to transform an IIS into a set-valued information system. Then, we can obtain the following binary relation [32]:

In this section, through a comparative research on these binary relations induced from the same IIS, we further show that our algorithms are meaningful for the studies of IISs. We choose three datasets, i.e., Mammographic dataset, Abalone dataset, and Car dataset, to carry out the comparative research. The summary information of Mammographic dataset and Abalone dataset is shown in Table 13. The Car dataset is shown in Table 14 [33]. Mammographic dataset and Abalone dataset are taken from UCI (http://archive.ics.uci.edu/ml/datasets.php).

Firstly, we introduce a new measure to evaluate the similarity degree between binary relations.

Definition 8. Let and be binary relations on a nonempty set U. The similarity degree of and is defined as

Example 8. For the car dataset given by Table 14, we can obtain the binary relations , , and as follows:where we take . Similarly, for every dataset given by Table 13, we can also compute the corresponding binary relations , , and , where we choose . Then, by Definition 8, we calculate the similarity degrees between , , and . The result is shown in Table 15.
Table 15 reflects the following facts.
We know that the binary relation induced by a dataset can be considered as the classification result of objects, where the elements in a successor neighbourhood with respect to the binary relation are a class. In this example, for Breast cancer dataset, the similarity degrees between relations are almost close to 1. This means that missing data have less impact on the classification of Breast cancer datasets. Thus, we may ignore these missing values in addressing this dataset. In contrast, the relations induced by Car dataset have low similarity degrees. This shows that missing values in Car dataset play an important role in the classification of this dataset. A natural question is which relation is better to investigate Car dataset. In Table 15, we can see that the similarity degree between and is higher than that between and . Furthermore, the similarity degree between and is higher than that between and . This indicates that is a good choice to be used to investigate Car dataset. Note that is determined by using Algorithm FMvSV. This illustrates that Algorithm FMvSV is important for the studies of IISs.
At the end of this section, we apply the uncertainty measure to estimate the importance of the proposed algorithm. In Example 8, we list three binary relations , , and with respect to Car dataset. By Definition 3, we can compute their entropies, which are shown in Table 16.
We know that entropy can measure the granularity of a binary. Proposition 1 shows that the finer the binary relation is, the higher the entropy of it is. Conversely, if the entropy of the binary relation is high, then the binary relation should be fine. Thus, Table 16 indicates that and are finer than . That is to say, and can provide more information for the studies of IISs. According to the above discussion, we know that and are obtained in terms of the proposed algorithms. This illustrates that the proposed algorithms are useful for investigating IISs.
Finally, a similar discussion can also be made about continuous dataset. We omit it here.

7. Conclusion

This paper established the FSvIS, which is an extension of the PSvIS. By means of the FSvIS, we constructed some algorithms to fill missing values in IISs. We carried out several experiments to analyze the effectiveness of these algorithms. The experiment results indicated that these algorithms are useful to investigate the IISs. There are still many interesting issues worth studying. First, we will further study the relationship between FSvISs and the existing information systems and study the application of FSvISs. Second, we can apply uncertainty measures for fuzzy relations, which are established by [34], to investigate the fuzzy set-valued information system which is defined by this paper. Finally, we will conduct a more comprehensive analysis of the impact of missing values on IISs.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Foundation of Shanxi Normal University (grant no. 872022).

References

Z. A. Pawlak, “Rough sets,” International Journal of Computer and Information Sciences, vol. 11, no. 5, pp. 341–356, 1982.
View at: Publisher Site | Google Scholar
G. Cattaneo and D. Ciucci, “Investigation about time monotonicity of similarity and preclusive rough approximations in incomplete information systems,” Rough Sets and Current Trends in Computing, vol. 3066, pp. 38–48, 2004.
View at: Publisher Site | Google Scholar
J. Dai and Q. Xu, “Approximations and uncertainty measures in incomplete information systems,” Information Sciences, vol. 198, pp. 62–80, 2012.
View at: Publisher Site | Google Scholar
S. Greco, B. Matarazzo, and R. Słowinski, “Handling missing values in rough set analysis of multi-attribute and multi-criteria decision problems,” in Lecture Notes in Computer Science, vol. 1711, pp. 146–157, Springer, Berlin, Germany, 1999.
View at: Publisher Site | Google Scholar
J. W. Grzymala-Busse and W. Rzasa, “Local and global approximations for incomplete data,” in Rough Sets and Current Trends in Computing, vol. 4259, pp. 244–253, Springer, Berlin, Germany, 2006.
View at: Publisher Site | Google Scholar
R. Jensen and Q. Shen, “Interval-valued fuzzy-rough feature selection in datasets with missing values,” in Proceedings of 2009 IEEE International Conference on Fuzzy Systems, pp. 610–615, Jeju Island, South Korea, August 2009.
View at: Google Scholar
Z. Meng and Z. Shi, “A fast approach to attribute reduction in incomplete decision systems with tolerance relation-based rough sets,” Information Sciences, vol. 179, no. 16, pp. 2774–2793, 2009.
View at: Publisher Site | Google Scholar
Y. Qian, J. Liang, D. Li, F. Wang, and N. Ma, “Approximation reduction in inconsistent incomplete decision tables,” Knowledge-Based Systems, vol. 23, no. 5, pp. 427–433, 2010.
View at: Publisher Site | Google Scholar
J. Stefanowski and A. Tsoukias, “Incomplete information tables and rough classification,” Computational Intelligence, vol. 17, no. 3, pp. 545–566, 2001.
View at: Publisher Site | Google Scholar
M. Kryszkiewicz, “Rough set approach to incomplete information systems,” Information Sciences, vol. 177, pp. 41–73, 2007.
View at: Google Scholar
Y. Leung and D. Y. Li, “Maximal consistent block technique for rule acquisition in incomplete information systems,” Information Sciences, vol. 153, pp. 85–106, 2003.
View at: Publisher Site | Google Scholar
J. Dai, “Rough set approach to incomplete numerical data,” Information Sciences, vol. 241, pp. 43–57, 2013.
View at: Publisher Site | Google Scholar
M. Kryszkiewicz, “Rough set approach to incomplete information systems,” Information Sciences, vol. 112, no. 1–4, pp. 39–49, 1998.
View at: Publisher Site | Google Scholar
D. Liu, T. Li, and J. Zhang, “A rough set-based incremental approach for learning knowledge in dynamic incomplete information systems,” International Journal of Approximate Reasoning, vol. 55, no. 8, pp. 1764–1786, 2014.
View at: Publisher Site | Google Scholar
J. Liang and Z. Xu, “The algorithm on knowledge reduction in incomplete information systems,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 1, pp. 95–103, 2002.
View at: Publisher Site | Google Scholar
M.-W. Shao and W.-X. Zhang, “Dominance relation and rules in an incomplete ordered information system,” International Journal of Intelligent Systems, vol. 20, no. 1, pp. 13–27, 2005.
View at: Publisher Site | Google Scholar
W. Xu, Y. Li, and X. Liao, “Approaches to attribute reductions based on rough set and matrix computation in inconsistent ordered information systems,” Knowledge-Based Systems, vol. 27, pp. 78–91, 2012.
View at: Publisher Site | Google Scholar
J. Yuan, M. Chen, T. Jiang, and T. Li, “Complete tolerance relation based parallel filling for incomplete energy big data,” Knowledge-Based Systems, vol. 132, pp. 215–225, 2017.
View at: Publisher Site | Google Scholar
J. Chen and J. Shao, “Jackknife variance estimation for nearest-neighbor imputation,” Journal of the American Statistical Association, vol. 96, no. 453, pp. 260–269, 2001.
View at: Publisher Site | Google Scholar
S. Wang, “Classification with incomplete survey data: a Hopfield neural network approach,” Computers and Operations Research, vol. 32, no. 10, pp. 2583–2594, 2005.
View at: Publisher Site | Google Scholar
A. S. Salama and O. G. El-Barbary, “Topological approach to retrieve missing values in incomplete information systems,” Journal of the Egyptian Mathematical Society, vol. 25, no. 4, pp. 419–423, 2017.
View at: Publisher Site | Google Scholar
Z. Pawlak and A. Skowron, “Rudiments of rough sets,” Information Sciences, vol. 177, no. 1, pp. 3–27, 2007.
View at: Publisher Site | Google Scholar
C. Wang, Q. He, M. Shao, Y. Xu, and Q. Hu, “A unified information measure for general binary relations,” Knowledge-Based Systems, vol. 135, pp. 18–28, 2017.
View at: Publisher Site | Google Scholar
Y. Yao, “Constructive and algebraic methods of the theory of rough sets,” Information Sciences, vol. 109, no. 1–4, pp. 21–47, 1998.
View at: Publisher Site | Google Scholar
L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965.
View at: Publisher Site | Google Scholar
W. Zeng and H. Li, “Inclusion measures, similarity measures, and the fuzziness of fuzzy sets and their relations,” International Journal of Intelligent Systems, vol. 21, no. 6, pp. 639–653, 2006.
View at: Publisher Site | Google Scholar
Y. Li, K. Qin, and X. He, “Some new approaches to constructing similarity measures,” Fuzzy Sets and Systems, vol. 234, pp. 46–60, 2014.
View at: Publisher Site | Google Scholar
G. Deng, Y. Jiang, and J. Fu, “Monotonic similarity measures between fuzzy sets and their relationship with entropy and inclusion measure,” Fuzzy Sets and Systems, vol. 287, pp. 97–118, 2016.
View at: Publisher Site | Google Scholar
Y. Huang, T. Li, C. Luo, H. Fujita, and S.-j. Horng, “Dynamic variable precision rough set approach for probabilistic set-valued information systems,” Knowledge-Based Systems, vol. 122, pp. 131–147, 2017.
View at: Publisher Site | Google Scholar
X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 1, pp. 110–121, 2011.
View at: Publisher Site | Google Scholar
B. Twala, M. Cartwright, and M. Shepperd, “Ensemble of missing data techniques to improve software prediction accuracy,” in Proceedings of the 28th International Conference on Software Engineering, pp. 909–912, Shanghai, China, May 2006.
View at: Publisher Site | Google Scholar
Y. Y. Yao, “Information granulation and rough set approximation,” International Journal of Intelligent Systems, vol. 16, no. 1, pp. 87–104, 2001.
View at: Publisher Site | Google Scholar
M. Kryszkiewicz, “Rules in incomplete information systems,” Information Sciences, vol. 113, no. 3-4, pp. 271–292, 1999.
View at: Publisher Site | Google Scholar
C. Wang, Y. Huang, M. Shao, and D. Chen, “Uncertainty measures for general fuzzy relations,” Fuzzy Sets and Systems, vol. 360, pp. 82–96, 2019.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2019 Zhaohao Wang and Xiaoping Zhang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies