Abstract
Feature integration theory can be regarded as a perception theory, but the extraction of visual features using such a theory within the CBIR framework is a challenging problem. To address this problem, we extract the color and edge features based on a multi-integration features model and use these for image retrieval. A novel and highly simple but efficient visual feature descriptor, namely, a multi-integration features histogram, is proposed for image representation and content-based image retrieval. First, a color image is converted from the RGB to the HSV color space, and the color features and color differences are extracted. Then, the color differences are calculated to extract the edge features using a set of simple integration processes. Finally, combining the color, edge, and spatial layout features allows representing the image content. Experiments show that our method produces results comparable to existing and well-known methods on three datasets that contain 25,000 natural images. The performances are significantly better than that of the BOW histogram, local binary pattern histogram, histogram of oriented gradient, and multi-texton histogram, with performances similar to the color volume histogram.
1. Introduction
The CBIR technique originated in the 1990s and has become a research hotspot over the last thirty years. The CBIR has been used to describe the process of retrieving similar images from a large collection based on the image content (color, shape, and texture). The essence of the traditional CBIR is associated with similar image retrieval. Several image retrieval techniques have been developed through years of technical innovation, such as content-based and object-based methods, as well as methods based on machine or deep learning. The classic CBIR is markedly different from object recognition as most images are similar, rather than containing the same objects. Thus, training and testing sets cannot be distinguished in the classic CBIR.
Feature extraction and representation is the most important issue in CBIR and has a close relationship with human perception, where color and edge cues are among the visual search components for perception stimuli and can express meaningful characteristics in images or scenes. Feature integration theory is a perceptual and attentional theory that explains how an individual combines pieces of observable information about an object to form a complete perception [1]. The first stage is preattention, in which we focus on one distinguishing attribute of the object. The second stage is focused attention, in which we take all the observed edge features and combine them to make a complete perception [1]. Representing image contents via extracting the color and edge features based on feature integration theory is a challenging problem. To address this problem, a novel and highly simple but efficient representation based on the multi-integration features model, namely the multi-integration features histogram (MIFH), is proposed for CBIR.
The proposed model is based on similar image retrieval rather than object-based image retrieval or object classification. Our main contributions are summarized as follows. First, a new method is proposed to extract lower-level features, which simulates the preattention stage, where color differences are used as the novel distinguishing attribute of an object. Second, we take all of the observed edge features and combine them to make a complete perception using a set of simple integration processes to provide powerful information for image representation. Third, a new image representation is proposed for image retrieval that combines color, edge, and spatial layout into an integrational feature. These attributes improve the discriminative power for color, texture, and shape features.
The rest of this paper is organized as follows. In Section 2, we review existing visual feature description methods and image retrieval techniques. In Section 3, the multi-integration features model and MIFH method are presented. We conduct CBIR experiments in Section 4, while Section 5 concludes the paper.
2. Related Works
Low-level features are highly popular in pattern recognition techniques. Color, texture, and shape features are widely used attributes in saliency detection [2] and traditional content-based image retrieval [3–10]. In the 2000s, ISO developed a multimedia content description standard (MPEG-7) to represent information regarding its content. The color, texture, and shape descriptors are the core components in this standard [3]. In the early stages of CBIR development, the use of a color histogram to describe image content has been a highly popular method in CBIR, since the color histogram is invariant to the orientation and scale. This characteristic makes it powerful for image classification and CBIR [3, 5–7].
Texture is also a highly popular visual feature for CBIR as it yields information regarding the spatial arrangement of the color or intensity in an image or region. The texture descriptors can be used to characterize image textures or regions, where the local binary pattern (LBP) is a famous texture descriptor [11]. In recent years, some LBP methods have been proposed for content-based image retrieval [12–14], such as the multichannel decoded LBP method [12] and the edge-local binary pattern (edgeLBP) approach [14]. In recent years, many algorithms have been proposed that combine multivisual cues [3–10, 12, 13] and have been utilized for various tasks. Moreover, simulating the mechanism of the human cortex to extract visual features provides a new avenue for CBIR. A set of Gabor filters with different frequencies and orientations can be used to extract useful visual features from an image, such as texture and edge information. This is similar to the perception of the human visual system and has been found to be particularly appropriate for texture representation and discrimination [15–18]. Liu et al. proposed the intensity variation descriptor [19] and the gradient-structures histogram [20] to represent image content and used them for image retrieval. The proposed representation methods have the advantages of being histogram-based and having the power to discriminate spatial layout, color and edge cues.
Shape features are widely used in object recognition and image retrieval because they contain important semantic information. Humans have the ability to recognize objects through shape; however, shape can only be extracted from a set of accurate segmentation methods, which is a complicated process. In many image processing cases, shape descriptors may also be required to be invariant to translations, rotations, and scale. To avoid the requirement of accurate segmentation, some feature descriptors exhibit strength in object recognition, such as shape context [21], scale-invariant feature transform (SIFT) descriptors [22], histogram of oriented gradients [23], and SURF descriptors [24]. The histogram of oriented gradient (HOG) is a feature descriptor that is widely used for object detection. Object-based image retrieval has become increasingly popular since 2004 when Lowe proposed the SIFT descriptors. This type of image retrieval usually uses BOW methods to represent local features. In subsequent years, various BOW-based methods have been proposed for object-based image retrieval, object recognition, and other applications [25–28]. Ahmed et al. proposed a new approach for CBIR by fashioning the spatial color information with shape and object features [29]. Pradhan et al. proposed a new method for CBIR that uses color and shape features [30, 31]. In some cases, the representation methods using dimensionality reduction [32] and structured optimal graph [33] have the power to extract the shape feature.
In recent years, deep learning techniques represented by convolutional neural networks (CNNs) have been shown to be effective at various vision tasks [17, 18, 34–39]. In early works, the approach of applying CNNs for image retrieval established the fully connected layer activation functions to be the global image descriptors [17, 35–37]. Deep learning approaches represented by CNNs have demonstrated a superior performance over hand-crafted features. For instance, Qayyum et al. proposed a framework of deep learning for CBMIR systems using a deep CNN that was trained for classification of medical imagery [38]. Tzelepi and Tefas employed a deep CNN model to obtain feature representation from the activations of convolutional layers using max pooling for CBIR [18]. Alzu’bi et al. introduced compact bilinear CNN-based architectures for several CBIR tasks using two parallel feature extractors without prior knowledge of the semantic metadata of image content [39].
Deep learning techniques can provide better performance overall in image retrieval and object recognition but require long training times. These approaches also do not typically mimic the mechanism of the human brain or the perception mechanism. Extracting visual features and representing image contents that mimic the perception mechanism within the CBIR framework requires further study.
3. The Multi-Integration Features Model and Representation
In digital image processing, color feature extraction, and edge detection play an important role in feature extraction and includes high-level concepts [40, 41]. In applications of CBIR and object recognition, color can provide powerful information for feature extraction and representation [2, 9]; even in the absence of shape information, combining color features with other visual features, such as textures, edge cues, and spatial attributes, is a popular technique to improve the image retrieval performance [19, 20]. Color and edge cues are visual search components for stimuli perception that can express meaningful characteristics of images or scenes. To represent the image content by extracting color and edge features based on feature integration theory, we propose a multi-integration features model to represent image content and use it for CBIR. The architecture overview of the process is shown in Figure 1.

There are three main components of the architecture shown in Figure 1. The first component is the preattention stage which is highlighted with the red solid line, in which we focus on one distinguishing attribute of the object using color differences. The second component is the focused attention, which is highlighted with the purple dotted line, in which we take all the observed edge features and combine them to make a complete perception. Three integrations of the edge features are used with a set of simple integration processes. For the third component, we propose a novel and highly simple but efficient visual descriptor to represent an image, which is used for CBIR. The E1, E2, and E3 integrations, color map, and log transformation processes occur simultaneously in the histogram.
3.1. Color Feature Extraction and the Initial Feature Integration
The HSV color space can mimic the human color perception well; therefore, many researchers have used it for color quantization [6, 8, 10, 19, 20]. The HSV color space is based on the cylindrical coordinate system and is defined in terms of the three components of hue (H), saturation (S), and value (V), where S is the distance perpendicular to the rod, V is the central rod, and H is the angle around the rod [19, 20, 40, 41]. In the proposed MIFH method, the color feature is extracted using color quantization.
To simplify the color quantization operation in the HSV color space, each of the H, S, and V color channels are uniformly quantized into 6, 3, and 3 bins, respectively, which results in a total of 6 × 3 × 3 = 54 color combinations. Let be the color combination or the color map as , where denotes the index value of the color map in the pixel coordinate , and The is used to calculate the color features in the proposed model.
To perform the initial feature integration, we focus on one distinguishing attribute of the object using the color difference, which we calculate in the HSV color space for the integration processes. The initial feature integration involves the characteristics of neurons that it receives a set of input signals and produces output, and the activation and excitation states of the neurons are simulated by using a threshold [42]. In order to simulate such characteristics, the sigmoid function is used as the active function in the initial feature integration, and it can map the input into an output in the range of .
We then transform the cylindrical coordinate system to Cartesian coordinates. Let be a random dot in cylindrical coordinates, and is the transformation of in Cartesian coordinates, where , , and . In the input image, the central coordinates are denoted as , and the coordinates for the eight neighboring pixels are denoted as . The color difference map is defined as follows:where , , and denote the color differences between the pixel locations and for the H, S, and V components. In the active function , the bias parameter is , and the value of the bias can measure how easy it is for a neuron to generate positive (negative) excitation. The color difference for the local regions contains rich information as it can measure the degree of color differences. Figure 2 shows that the edge and important regions can be simply detected using such operations. At the same time, the information for the three-color channels was also combined into a color difference map

(a)

(b)
A subsampling operation is implemented in the color difference map with a scale of two, and the subsampling image is denoted as . The size of is one quarter that of .
3.2. First Integration of the Edge Features
When extracting edge features for , we adopt the horizontal and vertical Sobel kernels as the weight values between feature maps. The Sobel operator is less sensitive to noise than other gradient operators or edge detectors while also being relatively efficient [40, 41]. These kernels are defined as follows:
We obtain the edge map using the Sobel operation which is denoted as
To extract more edge features, we need to perform the same actions three times to obtain the remaining three edge maps again as
We obtain the four edge feature maps and combine them into a gradient image , which can be calculated as
Then, the subsampling operation is implemented in with a scale of two, and the subsampling feature maps are denoted as . The size of is one quarter that of
3.3. Second Integration of the Edge Features
For every feature map , the same actions as the equations (3) and (4) need to be performed. As a result, a total of 16 edge maps were obtained. We first combine them into four gradient images using (5), which can be calculated as
The four gradient images are further combined into a single gradient image , which can be calculated as
Then, a subsampling operation is implemented for , where the scale is two, and the subsampling feature maps are denoted as . The size of is one quarter of .
3.4. Third Integration of the Edge Features
We combine the feature maps into a single gradient image with
To represent the image content, we uniformly quantize the gradient images , , and in different steps, where their index maps are defined as , , and , respectively.
In the index maps , , and their index values are . We set the number of bins to . In this study, the color index map and the edge index maps are used to calculate the visual features in the proposed model.
3.5. Feature Representation
The spatial layout information and histogram-based methods are widely used in CBIR and to describe visual features. Here, co-occurrence is adopted for image representation. Let and be two neighboring pixel locations in an image, where and denote the offsets of the -axis and the -axis, respectively. Two neighboring pixels have a distance of . We use to denote the co-occurring number of the two-pixel values and then define the co-occurring histogram of the color map and three edge maps as follows:
Following the concatenation of , we obtain the histogram . The subscript of the histogram is denoted as . A log transformation is applied to adjust the dynamic range in the histograms . We then define the MIFH as follows:
In image processing, the log function is important as it compresses the dynamic range of images that have large variations in their pixel values [41]. The log transformation expands the values for lower-level pixels in an image while compressing the higher-level values, thus emphasizing the lower-level pixels. In the proposed method, the use of a log function can reduce the effects of errors as derived from feature quantization and contains the spatial information obtained using the co-occurring neighboring pixels. Moreover, the proposed method can discriminate color, texture, and shape features.
4. Experimental Results
In the experiments, we follow the traditional CBIR (similar image retrieval) rather than object-based or similar types of image retrieval. We compare the MIFH method with the color volume histograms [4], local binary pattern (LBP) [11], multi-texton histogram (MTH) [6], histogram of oriented gradients (HOG) [23], and BOW histogram [25]. Besides, we also compare the MIFH method with some extend methods which are derived from our previous works (the multi-texton histogram [6] and the micro-structure descriptor [8]), including CPV-THF [43], STH [44], and CMSD [45] methods. In the comparison experiment, the vector dimensionality for the MIFH, LBPH, BOW, MTH, and CVH methods are 102, 256, 1000, 82, and 104 bins, respectively. The vector dimensionality for the CPV-THF [43], STH [44], and CMSD [45] methods are 242, 172, and 88 bins, respectively. In the HOG descriptor, there are nine bins with a block size of three and a cell size of six. For a fair comparison, the MIFH, LBPH, BOW, MTH, and CVH methods adopt the L1 distance as the similarity measure or their original distance measures.
4.1. Benchmark Datasets
The Corel image dataset is the most commonly used dataset to test the performance of the CBIR. In Corel image dataset, every category in the Corel gallery contains 100 images, on which digital zoom can be used to obtain various image sizes. To measure the image retrieval performance, we select categories to build the Corel-10K and Corel-5K datasets and used them in the CBIR, where Corel-5K contains 50 categories and 5,000 images, and Corel-10K dataset contains 100 categories and 10,000 images which contain all images of Corel-5K dataset.
The third dataset is the GHIM-10K with 10,000 images. All images in this dataset are sourced from the web or a camera and were collected by Guang-Hai Liu. The database contains 20 categories, such as building, sunset, fish, flower, car, mountain, and tiger. In the GHIM-10K dataset, each category contains 500 images at 400 × 300 or 300 × 400 in JPEG format. Corel-10K, Corel-5K, and GHIM-10K datasets can be downloaded from http://www.ci.gxnu.edu.cn/cbir/Dataset.aspx.
Three subsets, consisting of ten percent of each dataset were used as the query images. The performance is evaluated using the average results for each query in terms of precision and recall.
4.2. Performance Measurements
Selecting the metric for the performance evaluation is critical to compare CBIR experiments. In many CBIR cases [3–10], precision and recall metrics are popular ways to evaluate the performance of CBIR or other text retrieval techniques [46].
On the Corel datasets, each category contains 100 images and is denoted as , while for the GHIM-10K dataset. If there are retrieved images and images within the retrieved images are relevant to the query for the image retrieval, then the precision and recall metrics can be defined as follows [46]:
We set in our CBIR system, i.e., there are 12 retrieved images in the image retrieval.
4.3. Retrieval Performance
Feature selection is a very important issue in image representation. For example, edges and colors are widely used features in content-based image retrieval. In image representation, low-level features are combined into the proposed algorithm where quantization is used to extract the features. Various feature quantizations are important factors in CBIR experiments. Feature quantization has a greater contribution to the performance of content-based image retrieval because the proposed MIFH method is a histogram-based method. Thus, the number of edge bins and color feature bins can directly influence the retrieval results.
4.3.1. Investigation of Vector Dimensionality
To investigative the performance of the MIFH method using different histogram bins, we set the bin numbers of the color feature to 54, 72, 108, and 192 bins, respectively. For the first, second, and third integrations, the bins of the edge maps are set to 16, 32, and 64 bins, respectively. As shown in Figure 3, in most cases, the CBIR precision reduces with an increasing number of color bins on the Corel-10K dataset.

(a)

(b)
The GHIM-10K dataset differs because in many cases where the color bins increase, the CBIR precision increases. An exception is when there are 32 and 64 edge bins, as the precision increases when more color bins are added. Moreover, varying the edge bins for the first, second, and third integrations does not clearly increase or decrease the precision of the MIFH method.
In Figure 3, the precision of the MIFH method varies from approximately 49% to 53% on the Corel-10K dataset, whereas the precision for the GHIM-10K dataset varies from approximately 54% to 59%. The Corel-10K dataset is older and contains images with colors that are not as bright as the GHIM-10K dataset, which is relatively new. Color feature is an important factor in the proposed method, and the degree of color brightness affects the retrieval process for the proposed method. The precision change of the MIFH method does not present a clear trend for the two datasets. Thus, we reduce the vector dimensionality and set the number of color bins, the edge bins of the first integration, the edge bins of the second integration, and the edge bins of the third integration to 54, 16, 16, and 16 bins, respectively. In this case, the total vector dimensionality is 54 + 16+16 + 16 = 102 bins.
4.3.2. Contribution of Edge and Color
Different experiments are implemented using various combinations of visual features to investigate the contributions of edges and colors within the MIFH method, where the quantization levels of the colors and edges are 54 and 16 bins, respectively.
On the Corel-10K dataset, the precision and recall when only using the color feature are 46.48% and 5.58%, respectively. Using only the edge features of the first, second, and third integrations gives poor precision and recall. Compared with the edge features, the color features significantly increase the performance of the proposed method. If we combine these features, the performance significantly increases.
Figure 4 shows that the performances for both Corel-10K and GHIM-10K datasets when using color and edge features are better than using only the color features or edge features individually. The E1, E2, and E3 denote the edge features of the first, second, third integration, respectively. With more integrations, using the edge map for only a single integration does not improve the performance.

(a)

(b)
The repetition of the Sobel operation when calculating the edge gradient is an important factor in our method. Table 1 shows that the precision of the MIFH method increases with increasing repetitions of the Sobel operation on the Corel-10K dataset. The GHIM-10K dataset differs; the repetitions of the Sobel operation have not clearly increased or decreased the precision of the MIFH method.
The best precision for the MIFH method is achieved on Corel-10K datasets when the Sobel operation is repeated four times. With more repeated Sobel operations, the number of calculations also increase; thus, we set the final repetition of the Sobel operation to four in the proposed method.
To investigate the performance of the MIFH method using different bias parameters in calculating the color difference map, sigmoid function sigmoid (.) was utilized as the active function. Various bias parameters can result in different color difference maps and may further lead to different retrieval performance. As can be seen from Table 2, the best precision for the MIFH method is achieved on the Corel-10K dataset when the bias parameter . On GHIM-10K dataset, the precisions of MIFH method do not have much difference between the bias parameter and . Thus, the bias parameter was utilized in calculating the color difference map.
4.3.3. Performance Comparisons
In image representation, the color and edge features are extracted using the MIFH method, and it is natural that the MTH method can represent color and edge orientation features as well. Therefore, it stands to reason that both the MIFH and MTH methods can represent color texture features. The LBPH method is a well-known texture features descriptor that can also represent color texture features by extracting the LBP features along the R, G, and B color channels simultaneously. The HOG method is a well-known descriptor that is widely used in object recognition [23]. The technique counts the occurrences of gradient orientations in localized portions of an image. The image retrieval performances of the compared methods on the Corel-10K, Corel-5K, and GHIM-10K datasets are shown in Table 3.
The performances of the MIFH method are better than those of the BOW, MTH, HOG, and LBPH based on the experimental data. The precision of the MIFH method is higher than the STH, CPV-THF, CMSD, BOW, LBPH, HOG, MTH, and CVH methods at 4.93%, 0.68%, 2.71%, 22.60%, 15.72%, 25.01%, 12.09%, and 4.38%, respectively, on the Corel-10K dataset. On the GHIM-10K dataset, the precision of the MIFH method is higher than the BOW, LBPH, HOG, and MTH methods at 16.81%, 9.59%, 32.52%, and 3.35%, respectively. Nevertheless, the MIFH method is lower than the CVH method. On the Corel-5K dataset, the precision of the MIFH method is higher than the MTH, HOG, and LBPH methods. Considering the discrimination power of the color texture features, we confirm that the MIFH method is better than the LBPH method.
In order to determine the most suitable distance distances or similarity metrics, Canberra, x2 statistics, L1, histogram intersection, and Chebyshev distances were adopted in the CBIR experiments. As can be seen from Table 4, L1 distance gives much better results than other metrics such as Canberra, x2 statistics, Chebyshev, Cosine, and histogram intersection. The histogram intersection gives the worst results on the two datasets. L1 distance is the simplest distance and more suitable for large scale datasets; thus, we were adopted it as the final distance in the proposed MIFH method.
5. Results and Discussion
Color and edge cues are among the visual search components for perception stimuli and can express meaningful characteristics in images or scenes. Combining the color and edge cues can help to describe image content. Many descriptors can be used to characterize the texture characteristics, where the local binary pattern (LBP) is a well-known texture descriptor that can represent the local structures of image, but it cannot combine color and edge cues well. The MTH method is our prior work and was developed for the traditional CBIR, and it can represent the frequency for color and edge orientation information using a special histogram. CPV-THF, STH, and CMSD methods were derived from our previous works (the multi-texton histogram and the micro-structure descriptor), but those extend methods have not extracted the color and edge features based on feature integration theory.
The CVH method incorporates the advantages of histogram-based methods, as it takes the spatial information of neighboring pixels into account and has a performance that is similar with the MIFH method. It has the ability to discriminate color and edge features.
The BOW and HOG methods do not achieve satisfactory results on either dataset for the content-based image retrieval. The HOG method is a well-known descriptor that was developed for object recognition. Object recognition and content-based image retrieval are two different applications of pattern recognition. The task of object-based image retrieval is finding images that contain the same object or scene as the query image, whereas the classic CBIR is finding images that contain similar content as the query image. In some cases, using an object recognition-based method for content-based image retrieval cannot obtain satisfactory results compared with the application of object recognition.
The HOG method counts the occurrences of the gradient orientation to represent the image content. While the gradient orientation is a highly important cue in image representation, the discriminative power of only using the gradient orientation cue is a limiting factor in CBIR. In the BOW histogram, visual words obtained using the vector quantization of local features descriptors (e.g., Sift, SURF, ORB, and other local descriptors) require heavy computational burdens and may result in information loss.
Color is the most basic quality of visual content, so it is expected that the MIFH method has the ability to discriminate color features. In addition, edge cues are also contained in the MIFH method. This approach should have the ability to represent the general shape and texture features normally. To illustrate the representational ability of the visual features, we show two retrieval examples for the MIFH method in Figures 5(a) and 5(b). Both examples are used to show the visual effects of color and texture or other features, rather than showing whether the performances are good or not because not all images can return such good effects.

(a)

(b)
In Figure 5, most of the returned images are correctly retrieved and ranked within the top 12 images, and all of the top retrieved images exhibit a good match for both the shape and color texture to the queries. In addition, texture is a highly popular visual feature for CBIR that yields information regarding the spatial arrangement of the color or intensity of an image or region.
6. Conclusions
Feature extraction and representation is an important issue in CBIR and has a close relationship with human perception. Color and edge cues are visual search components for stimuli perception that can express meaningful characteristics of images or scenes. Representing image contents via extracting color and edge features based on feature integration theory is a challenging problem. To address this problem, a novel and highly simple but efficient representation based on the multi-integration features model, called the MIFH, is proposed for CBIR using the feature integration theory.
The MIFH method provides powerful information in image representation using a set of simple integration processes. Such attributes improve the discriminative power of the MIFH method and enable a higher performance compared with the BOW, LBP, HOG, and MTH approaches.
Data Availability
The dates and code are available at http://www.ci.gxnu.edu.cn/cbir/Dataset.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61866005 and in part by the project of the Guangxi Natural Science Foundation of China under Grant 2018GXNSFAA138017.