Abstract

The complex and changeable structures of ancient Chinese characters result in the decreasing accuracy of their image retrieval. To resolve this problem, a new retrieval method based on dual hesitant fuzzy sets is proposed. Dual hesitation fuzzy sets that can express uncertain information more comprehensively are employed in the feature extraction process of directional line elements. The multiattribute evaluation index of adjacent grids for the current grid and its corresponding membership and nonmembership functions are established, and the weight of each attribute is calculated by the dual hesitation fuzzy entropy, such that the proposed features can fully reflect the topological structure of ancient Chinese characters. Using the dual hesitation fuzzy correlation coefficient to measure the similarity between the ancient Chinese character images to be retrieved and the candidate images, the retrieval of ancient Chinese character images is realized. Experiments show that when the t0hreshold value of the correlation coefficient is 0.9, the average retrieval accuracy is 90.4%.

1. Introduction

Image retrieval of ancient Chinese characters can assist researchers in tracing similar glyphs in the research process. It is an effective tool for the study of Chinese characters in ancient books.

The theoretical basis of image retrieval of Chinese characters in ancient books is content-based image retrieval [1, 2]. The key link is feature extraction and matching, that is, color, texture, shape, and other features [35] or their fusion features [6] are extracted through mathematical descriptions, and similarity calculation is conducted based on image features to identify images that are similar. Traditional Chinese character feature extraction methods mainly adopt structural features [7] and statistical features [8]. The fuzzy set theory has a good ability to represent uncertain problems; thus, it is applied to the recognition and retrieval of handwritten Chinese characters. Wei and Guo [9] extracted image features using dual-elastic grid technology and the correlation fuzziness between grid words. Ran et al. [10] proposed a normalized overlapped fuzzy bielastic grid, which was used to improve the effectiveness of the proposed features. Li et al. [11] used fuzzy entropy to classify Chinese characters with high accuracy and improved the recognition rate of Chinese characters. Liu and Meng [12] improved the membership function of a fuzzy support vector machine and improved the ability of text classification and recognition efficiency.

The fuzzy set theory improves the efficiency of Chinese character recognition and retrieval to some extent. This theory expresses the singularity of fuzzy information, which causes information loss. Therefore, a hesitant fuzzy set [13] is proposed and applied to deal with multiattribute decision-making (MADM) problems. The dual hesitant fuzzy set (DHFS) [14] is an extended form of the hesitant fuzzy set that combines the membership and nonmembership degrees of multiattribute evaluation information to express the uncertainty problem more comprehensively. Su et al. [15] proposed some distance measures and similarity measures and illustrated their applications in pattern recognition. Wang et al. [16] proposed a new dual hesitation fuzzy distance metric for the multiattribute decision-making problem with completely unknown attribute weights. Zhang [17] proposed several distance measurements and entropy measurement methods for dual hesitation fuzzy sets, which avoided the process of data expansion and overcame the problem of information loss to a certain extent. Combining the DHFS, Wei et al. [18] proposed some dual hesitant Pythagorean fuzzy Hamy mean aggregation operators, which are more valid for dealing with the MADM problem. Wang et al. [19] extended the q-rung orthopair fuzzy sets (q-ROFSs) to a dual hesitant fuzzy environment and presented the dual hesitant q-ROFSs. Tang and Wei [20] defined some dual hesitant Pythagorean fuzzy generalized Bonferroni mean operators which are utilized to design some methods to handle the MADM problems. Tang and Wei [21] investigated MADM problems based on Bonferroni mean operators with dual Pythagorean hesitant fuzzy numbers.

In view of the complex and changeable structure of ancient Chinese characters, this study proposes an image retrieval method for ancient Chinese books based on dual hesitant fuzzy sets. The membership and nonmembership degrees between adjacent elastic grids are calculated under the set attribute index, and the obtained evaluation values are placed in a dual hesitation fuzzy set, which can be used to extract the features of directional line elements. The dual hesitant fuzzy correlation coefficient is used to evaluate the similarity between the query image and the candidate images to realize the image retrieval of ancient Chinese characters.

2. Dual Hesitant Fuzzy Attribute Index

2.1. Elastic Grid Division of Chinese Characters in Ancient Books

Ordinary elastic mesh [22] does not evenly distribute the pixel density in each grid and has a low tolerance for feature mutations caused by stroke offset. In this study, the general elastic mesh is improved and a new elastic mesh division method is designed. The Chinese character image is divided into elastic grids according to the pixel density in a certain direction. Subsequently, each layer is divided twice according to the pixel density. The grid lines after the second division may not be connected to straight lines, as shown in Figure 1. The grid in Figure 1(a) is used to extract features of horizontal strokes of Chinese characters in ancient books. The grid in Figure 1(b) is used to extract features of vertical strokes. The grid in Figure 1(c) is used to extract features of apostrophe strokes. The grid in Figure 1(d) is used to extract features of downstrokes.

The partition algorithm of Figure 1(a) is shown in Algorithm 1. Similarly, other grid division algorithms can be obtained.

Input: Chinese character image (size: M × N)
Output: vertical and horizontal grid lines ( and )
(1).for i = 0 to M−1//traverse the whole image pixels
(2).  for j = 0 to N−1
(3).   DV = DV + Density[i][j]//accumulate pixel density
(4).   if DV > Sum ∗ (k + 1)/5 then//the image is divided into 5 areas in the horizontal direction, Sum is the total number of pixels
(5).     = j; k = k + 1//the initial value of k is 0
(6).   end if
(7).  end for
(8). end for
(9).for i = 0 to M−1//traverse the pixels of each area of the above partition
(10).  for j =  to
(11).   DH = DH + Density[i][j]//accumulate pixel density
(12).   if DH > Sum ∗ (z + 1)/25 then//the image is divided into 25 areas
(13).     = i; l = l + 1; z = z + 1//the initial values of l and z are 0
(14).   end if
(15).  end for
(16). end for
(17).return,
2.2. Dual Hesitant Fuzzy Attribute Index Setting

Considering the correlation between adjacent grids and the grid to be calculated, the corresponding membership and nonmembership functions are provided under three attribute indexes. Using the extraction of horizontal stroke features as an example, the definition process of attribute indexes and their membership and nonmembership degrees is illustrated in Figure 2. In Figure 2, the six neighborhoods extracted from the left and right sides of any grid are (i = 1, 2, ..., 6).

2.2.1. Pixel Distance Index

In Figure 3, a is the pixel point in any horizontal stroke in the grid:

The standard normal distribution is used to characterize the membership and nonmembership of point a to [9]. Subsequently, the membership and nonmembership functions under the pixel distance index arewhere is the width of , is the distance between pixel point a and the left boundary of , and n is the number of horizontal stroke pixel points in .

2.2.2. Stroke Position Index

In Figure 4, stroke b intersects with the left edge of , and stroke c is separated from the left edge of . The membership and nonmembership functions under the stroke position index arewhere is the number of strokes intersecting with and , and is the number of strokes that do not intersect with .

2.2.3. Grid Position Index

In Figure 5, the left boundary of and the right boundary of overlap, and the left boundary of and the right boundary of partially overlap. The membership and nonmembership functions under the grid position index arewhere is the overlap height between the right edge of and the left edge of and is the height of .

Under each attribute index, the membership and nonmembership degrees of each grid are calculated by the aforementioned function, and the different numbers of Chinese character images are statistically analyzed with the average membership and average nonmembership degrees of three indexes.

In Figure 6, hi(i = 1,2,3) are the average membership of three indexes and gj (j = 1,2,3) are the average nonmembership of three indexes. As can be seen from Figure 6, h3 is higher than h1 and h2, and, from Figure 6(b), is higher than and . Through the aforementioned analysis, it can be concluded that the evaluation information of the grid has considerable differences under different indexes, and the three attribute indexes cannot be granted equal weight when the membership and the nonmembership degrees are calculated. In this study, the weight of the grid attribute index is calculated using dual hesitant fuzzy entropy, which improves the authenticity of the evaluation information.

2.3. Determination of the Attribute Index Weight

Dual hesitant fuzzy entropy can effectively describe the degree of uncertainty of dual hesitant fuzzy elements [23]; it is defined aswhere ϕ(d), φ(d), and d refer to the membership, nonmembership, and hesitancy of any dual hesitant fuzzy element, respectively.

The entropy theory can describe the degree of uncertainty of each attribute index to determine the attribute weight. Equation (9) is used to calculate the entropy of the attribute (j = 1,2,3) of the grid :

At this point, and are inversely proportional. The larger the , the smaller the ; the smaller the , the larger the . Therefore, equation (10) can reasonably describe the relationship between and :

3. Dual Hesitant Fuzzy Direction Line Element

3.1. Dual Hesitant Fuzzy Set

Based on the hesitant fuzzy set, the dual hesitant fuzzy set fuses the membership and nonmembership degrees of multiple attribute indexes to improve its ability to express uncertain problems in decision-making. Its definition is as follows:where and are sets of some numbers in [0,1], respectively, representing the membership and nonmembership degrees of element x in the nonempty set X under D [12].

3.2. Feature Extraction of Dual Hesitant Fuzzy Direction Line Element

The traditional feature vector of the directional line element only involves the set of directional attributes of strokes in a single grid; it does not consider the correlation between strokes in adjacent grids and in the grid to be computed, thus, affecting the stability of the proposed feature. Through equations (2)–(6), the evaluation value of (i = 1, 2, …, 6) to under each attribute index is calculated, and the dual hesitation fuzzy set is constructed. The weighted correlation coefficient between (i = 1, 2, ..., 6) and the ideal grid at the corresponding position is calculated [24], as shown in equation (12), which represents the degree of influence of adjacent grids on the current grid:where is the dual hesitant fuzzy set of the ideal mesh pair and the set element is , is the covariance of and , and represent the sth largest element of the membership degree and nonmembership degree of under each attribute [25].

The dual hesitant fuzzy direction line element characteristics of horizontal strokes in are as follows:where represents the dual hesitant fuzzy direction line element feature of horizontal strokes in the grid of Chinese characters, is the weighted correlation coefficient between and the ideal grid, card () is the number of elements in the horizontal stroke set in , and sum_H is the sum of the horizontal stroke pixel points in the entire image.

Similarly, the dual hesitant fuzzy direction line element features of vertical, apostrophe, and downstroke in the grid can be calculated, and the four stroke feature combinations in all the grids can be used as Chinese character feature vectors.

4. Image Retrieval Algorithm for Chinese Characters in Ancient Books

By introducing the dual hesitant fuzzy direction line element feature and using it to calculate the similarity between images of Chinese characters in ancient books, the retrieval results of images of Chinese characters in ancient books are obtained and the output is well ordered in Algorithm 2.

Input: image to be retrieved
Output: images of Chinese characters within the threshold of the correlation coefficient
(1). Open the image to be retrieved
(2). Preprocess
(3). Elastic grid division
(4). The dual hesitant fuzzy direction line element characteristics of were extracted
(5).while i<N do//traversal database
(6).  Calculate the correlation coefficient between and
(7).  if < then
(8).    is added to the result data table R(id, ρ)
(9).  else
(10).   i = i+1
(11).  end if
(12). end while
(13).return R(id, ρ)

The correlation coefficient was extended to the similarity measure between Chinese images. Equation (14) is used to calculate the correlation coefficient between the image to be retrieved and the image in the database. Multiple threshold values of correlation coefficients are set to control the number of image outputs:where is the feature vector of the Chinese character image to be retrieved, is the feature vector of the image in the database, N is the number of images in the database, and is the covariance of and .

5. Results and Discussion

5.1. Experimental Parameter Setting

The image samples of Chinese characters in ancient books were collected from “Si ku Quan shu,” an important document recognized in the study of Chinese characters in ancient books. The images were marked according to information such as the cabinet, ministry, and book to which they belong. Owing to the absence of a public retrieval dataset, an experimental dataset for image retrieval of Chinese characters in ancient books was established based on the Chinese character samples collected earlier. VS2017 and SQLSEVER2017 were used to realize the image retrieval system of ancient Chinese books. The retrieval experiment was conducted on 92840 ancient Chinese book samples in the dataset.

Definition 1. Precision (P) is the ratio of the number of similar Chinese images in the output result to the total number of Chinese images in the output result:

Definition 2. Recall rate (R) is the ratio of the number of images of similar Chinese characters in the retrieval results to the number of images of all similar Chinese characters in the experimental data:

Definition 3. F-Measure () is the weighted harmonic mean of P and R:where α = 1, P, and R have the same weight. combines the results of P and R; the higher the , the better the retrieval performance.

5.2. Analysis of Experimental Results
5.2.1. Retrieval Performance Analysis

Four groups were selected according to the structure of Chinese characters and 10 samples in each group were retrieved for analysis, as shown in Table 1.

The feature extraction methods in [9, 10] were used for the comparative tests. We use the same dataset and similarity measure to compare these algorithms. The retrieval results of the aforementioned experimental samples were counted and analyzed under the set threshold of the correlation coefficient, and the recall, precision, and were calculated. The final result was obtained from the average of 40 samples.

According to the experimental results, when is 0.7, the average recall rate of three experiments falls below 60%. To ensure the validity of the experiment, the threshold value was selected from [0.7, 0.9] and the interval was set as 0.05.

It can be seen from Tables 2 to 4 that, under the set correlation thresholds, the retrieval method in this study is superior to the comparison tests in terms of recall rate and precision. This indicated the effectiveness of this method in the image retrieval of ancient Chinese characters.

Figure 7 shows the contrast line chart of , where E1 refers to the algorithm of this study, E2 refers to the algorithm in [9], and E3 refers to the algorithm in [10]. From Figure 7, the performance of the algorithm in this study is higher than other algorithms. The proposed algorithm evaluated the influence of adjacent grids on the current grid and improved the robustness of the features. Therefore, the algorithm in this study is applicable in the retrieval of ancient Chinese character images.

Figure 8 shows the average recall and precision of the four groups of samples in this study.

(j = 1,2,3,4) are the average precisions and (i = 1,2,3,4) are the average recalls. As can be seen from Figure 8, the retrieval accuracy of this method for Chinese characters with left and right structures and a single structure is higher than that of other structures.

5.2.2. Retrieval Result Analysis

A Chinese character image was randomly selected to compare the retrieval results of the three experiments. As shown in Figure 9, the threshold value of the correlation coefficient is set as 0.8. The lower left view shows the first 15 images of the retrieval results of this experiment. The upper right view shows the first 15 images of retrieval results of the method in [9] and the lower left view shows the first 15 images of retrieval results of the method in [10].

As can be seen from Figure 9, the first 15 images in this experiment have a higher similarity than the comparison experiments. The similarity of the output images in the comparison experiments is significantly lower than that in this experiment. This indicates that the retrieval method in this study has a high accuracy.

6. Conclusion

This study proposes a method of image retrieval for Chinese characters in ancient books based on a dual hesitant fuzzy set. Dual hesitant fuzzy sets have the advantage of expressing uncertain information more comprehensively. They introduce the information into feature extraction of directional line elements, calculate the comprehensive evaluation value of adjacent grids to the current grid under multiple attributes, extract more complex and robust image features of Chinese characters in ancient books, and improve the retrieval performance. The experimental results show that the average precision and average recall of this method are 1–4 percentage points higher than those of the comparison methods under multiple correlation coefficient thresholds.

The follow-up work will be mainly conducted with the following two aspects: (1) improving the attribute index according to the topological structure of Chinese characters in ancient books and (2) optimizing the algorithm to improve the retrieval efficiency because the time complexity of the algorithm is relatively high as the membership and nonmembership should be calculated under multiple attributes.

Data Availability

The data used to support the findings of this study have been deposited in the https://github.com/ningmengweidexiaotanke/AecientCC_DHFS.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (Grant no. 61375075), the Natural Science Foundation of Hebei Province of China (Grant no. F2019201329), and the Key Project of the Science and Technology Research Program in University of Hebei Province of China (the Science and Technology Project of Hebei Education Department) (Grant nos. ZD2017208 and ZD2019131).