Abstract
A Boolean-valued information system (BIS) is an application of a soft set in which the data are mapped in a binary form and used in making applications not limited to decision-making, medical diagnoses, game theory, and economics. BIS may be lost for several reasons including virus attacks, improper entry, and machine errors. A concept was presented that the entire lost BIS can be regenerated from four aggregate sets through supposition. Based on that concept, this paper presents an algorithm to recalculate the entire BIS through a genetic algorithm (GA), named BISGA which is more general and easy to implement than the supposition method. A solved example is presented which explains how BISGA works. Furthermore, BISGA is implemented in Python and evaluated on both UCI benchmark datasets and randomized datasets for checking its efficiency and accuracy. Results show that the lost BIS is recovered significantly and accurately; however, the efficiency drops when the size of the BIS increases. This novel approach may help practitioners recalculate the entire lost BIS, which in turn helps in the decision-making process and conclusions.
1. Introduction
Soft set is used to present uncertain and vague data into crisp and clear data. Pawlak introduced the concepts of soft sets in 1994 [1]. Molodtsov then defined them in 1999 [2]. Soft sets can be used in decision-based applications such as game theory [3], medical diagnosing [4], and financial problems [5]. BIS is an application of a soft set. It maps the values of the soft set in a table in binary form, which helps in finding more appropriate choices by weighing all objects. The objects that satisfy more parameters are considered the best choice. Table 1 shows the representation of a BIS.
Wrong decision can be made with an incomplete BIS, which can yield in loss to an organization and individuals. A BIS having incomplete data is called an incomplete information system (IIS). There are several reasons for the IIS, i.e., improper data entry, errors in communication, and virus attacks. Researchers have been trying to solve this problem. These research works are categorized into two categories, the preprocessed category containing parity bits [6], supported sets, or aggregate sets [7] and the unprocessed category. Supported sets, aggregate sets, or parity bits can be extracted from the BIS before the data are lost or corrupted. The lost data can be recovered by using these sets [8]. The unprocessed category uses the remaining available data in the BIS to recover the lost data. Techniques in the unprocessed category include weighted average [9], probability [10], DFIS [11], ADFIS [12], and DFPAIS [13]. Table 2 shows an incomplete Boolean-valued information system, where missing data are represented by “.”
Rose et al. [6] first introduced the concept of data filling in soft sets through parity bits. 0s and 1s are inserted to make the number of 1s even or odd in the data. Then, they introduced the concept of aggregate sets [7]. These are four aggregate sets which are the row aggregate set, column aggregate set, left-right diagonal aggregate set, and right-left diagonal aggregate set, as shown in Tables 3–5. The sum of the values of each row is recorded in row aggregate sets. The sum of the values of each column is recorded in column aggregate sets. Left-right diagonal aggregate sets and right-left diagonal aggregate sets are calculated the same way as row and column aggegate sets. This method of aggregates is more accurate than parity bits in data filling of partially missing values in the BIS. Khan et al. [8] introduced the concept of recalculating the entire BIS from these four aggregates. They solved a problem manually using nonsimultaneous linear equations as a proof of concept. They inserted 1s and 0s according to universal and empty sets. Universal sets are those sets in which every value is 1, and empty sets are those sets in which every value is 0. They also made suppositions when it was necessary. When the suppositions failed, they had to track back and start alternative supposition again from the binary domain. Due to these suppositions, that solution was hard to implement, and if still it is implemented, then it will not be generic for every BIS such as the genetic algorithm (GA). Furthermore, many different BISs can have the same sets of aggregates because of the circular patterns in the BIS, as given in Figure 1. In Figure 1, there are four BISs which have the same sets of aggregates. More interestingly, all four tables in Figure 1 have the same aggregates as in the example manually solved in our base paper by Khan et al. [8], where the authors used to recalculate only one table and did not focus towards the implementation of their technique and the other possible tables which would have got calculated if they had implemented their technique. This seems to be one of the possible weaknesses of their approach. It should be noted that a set of different BISs recalculated still satisfy all aggregates and hence can be used for decision purposes. This investigation becomes possible due to applying GA on recalculating BISs. However, further investigation will be required for recalculating only the original BIS from the set of aggregates.

GA [14] is a metaheuristic bio-inspired algorithm used for searching and optimization problems in many domains. The main principle of GA is survival of the fittest, which tries to select fittest individuals or chromosomes from the available population. Two powerful operators of crossover [15] and mutation [16] are applied to the selected chromosomes to further filter the best allele and genes from the chromosomes selected. The fitness of crossed and mutated chromosomes is then checked, and best of them are crossed over and mutated again to come closer to the fittest solution in several iterations until the satisfactory solution is found.
Therefore, this paper recalculates the entire BIS in soft sets from aggregates using GA and presents an algorithm named BISGA. This article has the following four initial contributions, while the main aim is proposing BISGA:(1)To identify the constraints which can be applied to chromosomes to narrow down the search space(2)To find the appropriate genetic operators and customize them to be used in the 2D environment(3)To derive a four-dimensional fitness function(4)To analyze the accuracy by comparing with the original BIS and efficiency through an average number of generations
The rest of the paper is further divided into another five sections: the literature review is provided in Section 2. The proposed algorithm is discussed in Section 3. Section 4 contains the results and discussion, and the paper is concluded in Section 5.
2. Literature Review
This section consists of three subsections. In the first subsection, we provide the literature review for soft sets. In the second subsection, we discuss incomplete soft sets and the techniques for handling incomplete soft sets and, in the last subsection, we provide some discussion on the genetic algorithm.
2.1. Soft Sets
Soft set is defined as “let U be an initial universal set and P be a set of parameters. A pair (F, P) is called a soft set (over U) if and only if F is a mapping of P into the set of all subsets of the set U.”
For example, let be a set of phones and be a set of few features representing “latest phone,” “budget,” “wireless charging,” “5G,” “high-resolution camera,” and “edge,” respectively.
We suppose that the latest phones are , , , and , the budget phone is , the phones with wireless charging features are , , , and , the 5G phones are , , , and , the phones , , , and have a high-resolution camera, and , , , , and are the edge phones. These data are represented in Table 1. The data tell us that is not a budget phone, but it contains all the features which makes it a more appropriate choice:
Majumdar and Samanta [17] used a soft set to diagnose diseases. Moreover, Kharal [5] used a soft set to point out financial problems. Furthermore, Deli and Cagman [3] demonstrated the soft set theory applications in game theory. They employed some set operations which determine the soft game’s solution, making the game easy to apply.
In any case, if the data were not available completely, it would be not easy to decide and is called BIS with missing information, as shown in Table 2.
2.2. BIS with Missing Information
The information in the BIS can be lost because of errors, improper entry, virus attacks, etc., called an incomplete information system. Researchers have found missing or lost data in an incomplete soft set. These techniques are divided into two main categories: preprocessed category and unprocessed category.
The first attempt in the unprocessed category was made by Zou and Xiao [9], where the authors presented the weighted average technique. In this technique, the decision can be made without finding the missing data. Kong et al. [10] presented a probability-based technique to find exactly the missing or lost data. This technique can find the values in a range between 0 and 1. Qin et al. [11] presented DFIS. It can recover the lost data by using parameter associations. Consistent associations consider same values between parameters and inconsistent associations consider opposite values between parameters. First, the associations between all parameters are calculated, and the lost data are recovered according to them. Khan et al. [12] presented that associations must be found after each iteration in ADFIS. When one piece of data is decisively recovered, the associations with that piece of data must be recalculated. They keep recalculating it with recently included data until the final piece of data is recovered. Kong et al. [13] presented some cases, through which they have shown that ADFIS would not work every time. They suggested in DFPAIS that the decision should not be taken alone on the highest association. All associations should take part in the decision process. The highest association still has a bigger impact but does not possess all authority. These works are tested on UCI benchmark datasets. UCI consists of 4 datasets, i.e., zoo datasets, flag datasets, congressional datasets, and heart datasets.
As far as the techniques of the preprocessed category are concerned, the proposed work is related to that category. Therefore, we will discuss it with its necessary mathematical details as some of its equations will be used in the proposed work. Preprocessed category techniques are only usable when we have some compressed or extracted data of the lost data. Khan et al. [8] gave a concept that the entire BIS can be regenerated. They used four aggregate sets which were introduced by Mohd Rose et al. [7]. Among these four aggregates, two aggregate sets are the sum of the 1s of every row and the sum of every column yielding in row aggregate and column aggregate sets, respectively, as shown in Table 3. A row aggregate can be found mathematically as follows:where “R” denotes a row, “u” is the current row, and “” represents all the columns of the BIS.
Similarly, a column aggregate can be found as follows:where “C” denotes a column, “” is the current column, and “u” represents all the rows of the BIS.
Since diagonals can also be treated in two ways, therefore, the other two aggregates are left-right (LR) and right-left (RL) diagonals. The aggregates of these LR and RL diagonals are calculated as the arithmetic sum of the values in each diagonal as shown in Tables 4 and 5 and highlighted through the same color for a diagonal. A number of LR and RL diagonals are calculated using the equation as follows:where and are the number of rows and columns in the BIS, respectively.
Both LR and RL diagonal aggregates can be calculated in two steps as given below.
Case 1. When , i.e., the LR diagonals starting from the first row and ending with the last column and the RL diagonals starting from the first row and ending with the first column, we obtain
Case 2. When , i.e., the LR diagonals starting from the first column and ending with the last row and the RL diagonals staring from the last column and ending with the last row, we obtain
2.3. Genetic Algorithm
Genetic algorithm is a metaheuristic algorithm presented by JH Holland in 1975 [14], which mimics the Darwinian theory of evolution. It works on the idea of survival of the fittest. A random set of individuals also called chromosomes is generated initially. To assess the fitness of all individuals, a fitness function is provided. The population must pass through all genetic operators of the algorithm and generate new individuals. The search is terminated if a solution passes the criteria of the fitness function. These genetic operators are selection, crossover, and mutation operators.
Selection operators are used to select individuals from the population for breeding [18]. In roulette wheel selection, a wheel or pie is divided among the individuals based on their fitness values. Individuals having better fitness values take a more significant slice of the pie.
Crossover operators are used to exchange information among individuals [15]. Two or more individuals are taken, and at least one new individual is produced. Single-point crossover divides the genes into two separate genes that exchange with the other half of the other parent.
Mutation operators mutate the newly born individuals to diversify them from current individuals [16]. In bit flip mutation, bits are flipped randomly. Zero becomes one, and one becomes zero, whereas swap mutation swaps the genes in the chromosome.
Aj and Pd [19] and Larranaga et al. [20] presented that binary genetic algorithms have difficulties with irregular patterns and hamming cliff and struggles in attaining accuracy. A two-dimensional GA was presented by Tsai et al. [21] for airline scheduling problems. 2D genomes were taken as a single-dimensional array while applying the crossover operator in their study. A most recent work on GA is multivariate missing data imputation including the filling of continuous and discrete missing data [22], but it is different from the proposed work as it still fills partial missing information instead of the entire matrix unlike the proposed work.
3. Proposed Work: BISGA
This section of the proposed work mainly consists of step-by-step BISGA accompanied with a step-by-step solved example. The algorithm for BISGA is presented in pseudocode 2 and visualized with the help of a flowchart shown in Figure 2.

3.1. Population
For BISGA, the population consists of all possible BISs of the same size. Each BIS is called a chromosome or individual. The size of a chromosome is calculated same as a table in which the number of columns is equal to length of column aggregate sets and the number of rows is equal to length of row aggregate set.
The initial population is provided by randomly assigning ones and zeros to the table, but a constraint of row aggregate satisfaction is applied. This constraint is considered to be the first contribution of BISGA and stated as follows: “each row must be assigned the number of ones that are equal to the value of its aggregate.” This constraint is applied because the BISGA will not need to calculate and check the fitness of row aggregates at each iteration, and only the other three aggregates will need to be further checked for satisfaction. This constraint not only increases the efficiency of BISGA apparently by 25% but also helps its performance improvement, because the overall number of ones in each chromosome is exactly equal to the required number as in the original BIS and remains fixed throughout the execution. The minimum number of selected chromosomes or parents should be two for the process of crossover. Selecting as more parents as possible chromosomes will increase the chances of finding the fittest chromosomes at this early stage.
3.2. Selection
For selecting parents from the population for crossover, both roulette wheel selection and tournament selection operators are suitable for the problem. They both have similar impacts. Later, in the results, we have used roulette wheel selection.
3.3. Fitness Function
Fitness function in genetic algorithms is used to assess chromosomes or individuals. The fitness function is newly derived for BISGA and is considered to be its main contribution. Initially, fitness for an aggregate x of an individual is calculated by finding the absolute difference between individual’s aggregate sets and actual aggregate sets using equation (1) as follows:where “ind” indicates the individuals of the current chromosome, N indicates the aggregate set length, and “act” indicates the actual aggregate sets of the BIS.
Each aggregate fitness of an individual is then added with each other to find the accumulative fitness of the whole individual using equation (9). The individual is considered fitter if the difference is lower or closer to zero. Zero fitness means that the input aggregates have been satisfied by the individual. The fitness function formula is given below. Other useful discussion on fitness can be found in the relation section of discussion.
3.4. Crossover
As for crossover, single-point crossover is proposed initially for BISGA. The crossover is performed on the first half of rows from parent 1 and the second half of rows from parent 2 making child/offspring1. Similarly, the remaining second half of parent 1 and the first half of parent 2 produce the second offspring. If more than two parents are taken, they will be crossed in the same fashion based on their fitness to produce more offspring. It should be noted that the initial constraint applied to row aggregates is maintained in this type of crossover.
3.5. Mutation
A row-wise swap mutation operator is used for mutation in BISGA. For BISs having less than hundred values, one mutation in the whole BIS is enough. While for bigger BISs, one to two percent mutation of all values is required. When there is one mutation in the whole BIS, obviously, swapping should be performed in the same row to maintain the constraint of row aggregate satisfaction. Similarly, in case of more than two mutations in a BIS, the constraint of row aggregate satisfaction must be considered if the mutation is selected randomly in a single or multiple rows of an individual. More discussion related to mutation can be found in the relevant section of discussion in this article.
3.5.1. Pseudocode 1: Fitness Function
Requirements are as follows: row aggregate sets, column aggregate sets, left-right diagonal aggregate sets, right-left diagonal aggregate sets, and individual’s chromosomes:(1)Find the aggregates sets of the individual (using equations (2) to (8))(2)Find the fitness of each aggregate set for the individual (using equation (9))(3)Calculate the fitness of the individual (using equation (10))
3.5.2. Pseudocode 2: BISGA
Requirements are as follows: aggregates of the actual BIS, chromosomes, and fitness function:(1)Generate two or more chromosomes of BIS size(2)Put the number of ones randomly in each row according to the size of that row aggregate and fill the remaining cells of chromosomes with zeros(3)Calculate the fitness of each chromosome (using pseudocode 1)(4)End if at least one chromosome has zero fitness, otherwise go to next step(5)Crossover two or more fittest chromosomes with each other row wise and other fitter with each other and so on(6)Mutate two percent of the genes row wise using swapping and go back to step 3
3.6. Solved Example for BISGA
In this section, we demonstrate an example of BISGA as a proof of concept. Consider 4 by 4, Table 3 will be referred as the actual BIS, and its aggregate sets are as follows:
3.6.1. Step 1: Initial population
Four chromosomes are generated randomly as the initial population as given in Figure 3, and the size of each chromosome is equal to the number of row aggregates multiplied by the number of column aggregates.

3.6.2. Step 2: Initial Constraints
The number of ones generated randomly in first row is 3 which is equal to the first row aggregate. Similarly, two, one, and two number of ones are inserted in the second, third, and fourth rows, respectively. The remaining cells are filled with zeros.
3.6.3. Step 3: Calculating Fitness
Considering the fitness calculation of chromosome 01, as row aggregates of the actual BIS and chromosome 01 are the same, the fitness for this aggregate is 0. Column aggregates of the actual BIS are , while those of chromosome 01 are . The absolute difference of these column aggregates is , and its total is 4, which is the fitness of the column aggregate for chromosome 01. Similarly, the fitness of the left-right and right-left diagonals for chromosome 01 is 2 and 4, respectively. Adding the fitness of each aggregate set (0 + 4 + 2 + 4) becomes 10 which is the fitness of chromosome 01. In a similar fashion, the fitness of other three chromosomes can be calculated for each chromosome in Figure 3.
3.6.4. Step 4: Selection
Based on the fitness of chromosomes calculated in the previous step, chromosome 01 and chromosome 04 are selected as the fittest among all for further operations. Selected chromosome 01 is shown by a black background and other 04 by a white background, and bold borders in Figure 1 help in understanding in the next step of crossover.
3.6.5. Step 5: Crossover
Chromosome 01 and chromosome 04 are row wise crossed over uniformly such that the first two rows of chromosome 01 and the last two rows of chromosome 04 are combined to generate a new offspring, as shown in Figure 4. Similarly, the last two rows of chromosome 01 and the first two rows of chromosome 04 are combined for making the second offspring from the selected chromosomes. In a similar fashion, chromosome 02 and chromosome 03 are crossed over, which are not shown here for simplicity purpose.

3.6.6. Step 6: Mutation
As we already insert the right amount of 0s and 1s in our chromosome as initial constraint, now we can only use swap mutation. Suppose the first and the last element of offspring 01 in the third row are selected for mutation, then first 1 will become 0 and last 0 will become 1, as underlined in Figure 3. Hence, the aggregate value of row 3 will not be affected and will remain equal to 1 as its previous value. Similarly, all other offspring will be mutated.
Following the flowchart of BISGA, as given in Figure 2 and pseudocode 2, now the fitness of each offspring will be checked as in Step 3. If no offspring has 0 fitness, Step 3 to Step 5 will be repeated until an offspring with zero fitness is found. In our example, offspring 01 after mutation is the same as the actual BIS, and the algorithm will terminate after finding its fitness equal to zero.
4. Discussion and Results
In this section, we present some of the initial results obtained through BISGA after implementing it in Python. Second, the results are followed by some necessary discussions related to these results and BISGA operators.
4.1. Results
Two main results of BISGA are already included in this article in the form of Figure 1 and a solved example. Among these, Figure 1 results are the most important because this example is taken from the base paper, where Khan et al. [8] recalculated only one BIS, while BISGA calculated four different tables which show that BISGA is powerful than the Khan et al. [8] approach. In addition to those fours BISs, another 5th BIS is calculated which is not shown in Figure 1. We mention it here in the text for the information of readers, and readers can find that the aggregates of all these BISs are the same as those of the original BIS. The fifth BIS is [[1, 1, 0, 1, 0, 1, 1], [1, 0, 0, 1, 0, 1, 0], [0, 1, 1, 1, 0, 0, 1], [0, 0, 0, 0, 1, 1, 1], [1, 0, 0, 1, 0, 1, 0], [0, 1, 1, 1, 1, 0, 0], and [1, 0, 1, 0, 1, 1, 1]]. If BIS1 (Figure 1) is considered to be 100 percent accurate as it is the original BIS, then the accuracy of other BISs is given in Figure 5.

When BISGA was run ten times for recalculating the BIS of the base example, the original BIS was calculated four times, while the other four equivalent BISs were calculated in remaining tests. The frequency of original and equivalent BIS recalculation through BISGA is given in Figure 6 which also shows the number of iterations used for these calculations. Figure 7 illustrates the progression of solutions through random generations. Wrong values are highlighted in the tables.


The second important result already presented is the self-explanatory step-by-step solved example which elaborates the concept of BISGA.
In addition to these two important results, we run BISGA after implementing it in Python with different sizes of the BIS on UCI benchmark datasets and dummy Boolean datasets. UCI benchmark datasets have 4 datasets which are already used by ADFIS for data filling in soft sets [12]. First, 10 10 values are taken from all 4 datasets. The average accuracy percentage and efficiency is given in Figure 8. The experiments are also performed on dummy data for more generic results of the proposed algorithm.

Other experiments are conducted on random binary data. BISs of varied sizes, i.e., 5∗5, 5∗6, 6∗6, 6 ∗ 7, 7 ∗ 7, 7 ∗ 8, 8 ∗ 8, 8 ∗ 9, 9 ∗ 9, 9 ∗ 10, and 10 ∗ 10, were randomly created. Twenty random BISs were generated for each abovementioned size. Then, their aggregates were extracted and given as input to BISGA. Every BIS was regenerated twenty times by the algorithm. The solution was then compared with the original BIS. The average accuracy achieved for these BISs is given in Figure 9. It should be noted that less than 100% accuracy for larger BISs means that different BISs were recalculated for the same BIS, which are equivalent to the original in term of their aggregates. The average number of generations it took to reach the solution is given in Figure 10.


4.2. Discussion
Different types of queries raised regarding results are discussed here. First, the recalculation base paper [8] also uses universal and empty aggregates in addition to supposition. So the question rises that the Khan et al. [8] technique may be more accurate in recalculating the original BIS. However, if we observe in the same example of Figure 1, every equivalent BIS has the same universal and empty aggregates, so it clarifies that if empty and universal aggregates are considered in BISGA, still equivalent BIS will get calculated.
Second, as the size of the BIS increases, two issues will be faced. The first issue is that it will take more iterations and more time to get the solution, and the second issue is that the number of equivalent BIS increases and that it becomes difficult to find out the original BIS. Both these issues related to larger BISs are natural and specialized techniques, and operators will be needed to minimize them.
Third, low accuracy in larger BISs does not mean that every BIS calculated has low accuracy, as apparent from Figure 9. If we succeed in recognizing the original BIS, the accuracy will be 100% in that case. For instance, three original BISs among 20 were recalculated for a 10 ∗ 10 BIS in our experiments, but the accuracy shown is the average of all 20, among which 17 are equivalent BISs. Therefore, the average has reduced the accuracy to 57%. It is worth noting here that the fitness of all those BISs including 17 inaccurate is zero, and that is why they are called equivalent due the reason of the same fitness.
As mentioned earlier, we have checked BISGA on benchmark datasets, and we have used dummy binary data for our experiments that may not reflect any specific soft sets. However, BISGA works the same for any binary data regardless of the fact whether it is a soft set or not. Similarly, 2D binary data from any other domain can be recalculated using BISGA.
It is important to discuss the initial constraint of row aggregate satisfaction for each individual. It fixes the number of total zeros and ones equal to the sum of any aggregate in the whole BIS. If that constraint is not put, the initial population of BISs will be generated with random zeros and ones and not necessarily equal to any aggregate. Hence, it would obviously take more iteration to fix the numbers of ones and zeros plus arrange them in such an order to satisfy the fitness function. Furthermore, investigation will help if the constraint as applied to row aggregates is better or if it applied to any diagonal results in the fast solution. The iterations of BISGA would be further reduced if such constraints can be applied initially on more than one aggregate in the future.
Last but not the least, why mutation used is BISGA though the values have very limited binary domains, and there are 99.99% chances that both zeros and ones will be present in any reasonable population selected, as the crossover used here is single point and row wise. Without mutation, a cell or gene will not get its required bit with the combination of other bits in the same row or column. So the mutation makes it flexible to flip the bit in any cell where it is needed.
5. Conclusion
In this paper, we have presented BISGA that can recalculate the entire BIS. A constraint is applied on the selection of the initial population which satisfies one fourth of the fitness and is maintained throughout the algorithm. Appropriate genetic operators are selected, such as single-point crossover and swap mutation. A fitness function is derived for this problem. Experiments are conducted on UCI benchmark datasets and randomized datasets to determine the performance of the algorithm. The genetic algorithm is evaluated based on accuracy and efficiency. Results show that when the size of the BIS increases, the chances of circular patterns increase, which affects the efficiency and accuracy of the algorithm. Other aggregates can be added to avoid the circular patterns for future works. In the future, there are some expected applications of BISGA in data integrity and data compression.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding the publication of this article.
Acknowledgments
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large groups RGP.2/212/1443.