Abstract
In this paper, machine learning (ML) techniques are applied at an early stage of Image Processing (IP). The learning procedures are usually applied from at least the image segmentation level, whereas, in this paper, this is done from a lower processing level: the edge detection level (ED). The main objective is to solve the edge detection problem through ML techniques. The proposed methodology is based on a classification of edges made pixel by pixel, but the predictors employed for the ML task include information about the pixel neighborhood and structures of connected pixels called edge segments. The Sobel operator is employed as input. Making use of 50 images that belong to the Berkeley Computer Vision data set, the average performance of the validation sets when employing our Neural Networks method reached an F-measure significatively higher than with the Sobel operator. The experiment results show that our post-processing technique is a promising new approach for ED.
1. Introduction
Digital Image Processing or Image Processing (IP) evolved during the sixty’s thanks mainly to space programs like Range 7, conducted by the Jet Propulsion Laboratory [1]. IP encompasses many different tasks ranging from simple ones, for instance, enhancement or smoothing, to the most complex ones related to the understanding of a complete visual scene. This graduality can be roughly represented as a continuum (see Figure 1).

Edge detection (ED) is one of the main IP techniques, and it has found applications in a wide range of tasks, such as pathological diagnostics in medicine [2, 3], with a special focus in tumoral discovering, as well as remote sensing [4], which is useful for agriculture and biology, and more recently for research related to climate change. Other relevant fields in which ED has been applied successfully are the military industry, surveillance [5], and others [6–10].
In the last decades, machine learning techniques have started to be employed for solving IP problems [11–16]. Nevertheless, it is not easy to find learning approaches employed for low-level processing tasks, for instance, IP tasks lower than contour detection [17, 18]. An example of a close technique to ED is the use of Convolutional Neural Networks (CNN) for low-medium level tasks such as corner detection [19]. However, corner detection can be considered a more complex task than ED. This task is useful in video processing for tracking objects. An example of research making use of supervised techniques on the ED problem can be seen in [20], where the aim was classifying edge segments–linked structures of edges–. In that research, the goal was limited in the sense that the edges could not be classified locally—pixel by pixel— an approach that is specifically explored in this work. Due mainly to that limitation, in this paper, an algorithm that performs a classification at the pixel level is proposed. This approach allows classifying a pixel as “edge” or “nonedge” considering the pixel information combined with other local information which includes its neighborhood and the segment to which that pixel belongs. Moreover, this paper’s approach does not consist of taking the feature vector from the whole image (or a subimage) as it has been done in so many researches so far (and that can be considered the state of the art). Instead of that, the feature vector is created for each pixel.
ML has been traditionally used mainly for applied research, which is logically associated with high-level processing tasks (see Figure 1). As it is shown in this paper, there are several advantages for using ML for ED purposes.
This paper is structured as follows: the first three subsections of the next section are devoted to some needed preliminaries, including the basis of a digital image, ED problem, and ML techniques. “The proposed methodology” is focused on our proposal, which can be considered as a postprocessing technique for ED. In the first place, the problem of building a new ground-truth version is explored (see “Building a suitable ground-truth”), secondly, the construction of the variable predictors (see “Collecting the predictor variables”), and finally in the experiment phase (“Using Machine Learning for Edge detection Problems”). The last sections are devoted to the results (“Results and Discussion”) and conclusions, respectively.
2. Edge Detection with Machine Learning Techniques
2.1. Digital Image
Let us denote a digital image by I, and its pixel coordinates in the spatial domain by . For clarity’s sake, the coordinates are integers, where each point represents a pixel with and . Therefore, the size of an image, , is the number of its horizontal pixels multiplied by its number of verticals. The value of the spectral information depends on the digital image type that is considered. Some of these image types are:(i)Binary map: (as well it is usually expressed as ).(ii)Grayscale image: .(iii)Soft image. This is as well referred as a normalized grayscale image.
In this paper, we deal with binary maps, grayscale images, and soft images.
2.2. Edge Detection Problem
The main goal of ED algorithms is to detect those pixels in which the intensity change is significant. An ED algorithm transforms an image into a binary image. In this binary image, the white pixels (or “one values,” as they are usually presented) represent those pixels that have been identified by the ED algorithm as edge pixels. On the contrary, the black pixels represent those identified as non-edge pixels.
In a digital image , a pixel is considered as an edge if in this pixel there is an important change of the intensity function [3, 21]. An Edge Detector is an algorithm that takes the image as an input and then converts it into a binary image–edge map– [22].
A well-known example of ED algorithm is the Sobel operator [23, 24]. It performs only through two ED steps (see [11] for more information about the ED steps): the feature extraction and the blending or aggregation of the features–channels–. Mathematically, it works as follows: given a pixel and , we defineas the horizontal component, andas the vertical one. The horizontal and vertical masks are, respectively, these two:
,
Throughout the blending phase, both components are aggregated into a single value named the gradient. The most common aggregation function for computing the gradient is the Euclidean distance.
Finally, this soft image (We consider that the blended image is as well normalized, and then it is a soft image.) is binarized through the last ED step: the scaling. The most common procedure for this last step is the use of a threshold value .
2.3. Supervised Classification Problem
The goal of supervised learning is to build a concise model to classify items into known classes in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances, where the values of the predictor features are known but the value of the class label is unknown [25]. It is possible to find a huge number of classification algorithms that have the common aim of maximizing the considered accuracy measures depending on a specific problem or dataset.
The classification problem can be expressed mathematically as follows [25]. Let be the predictor variables with in such a way that they configure the measure space with. The predictor can be either a quantitative or a categoric variable. The aim is to classify a certain case from the sample, then . Every case takes as the vector’s value. This vector provides the information that is going to be used for the classifier to assign the case to any of the previously defined classes . The key point of this classifying process is to build the classification rule that is going to assign to any case a specific class.
A classifier is an algorithm that from the set with classes assigns any nonlabelled case x to a single class, and this classification is made by means of its characteristics.
The classifiers can be supervised or unsupervised. The supervised ones make use of example cases already tagged belonging to the training set, from which the algorithm builds—“learns”—the classification rules [26]. Contrarily, in the case of unsupervised algorithms, all cases used for the learning process are unlabeled. Supervised and unsupervised learning are both global techniques inside what is called machine learning.
In this paper, we have focused on three well-known supervised classifiers such as Logistic regression, Neural Networks (NN), and Random Forest (RF) [27]. These classifiers are then used for binary classification as there are two possible outcomes for a given pixel: “edge” or “nonedge.” A novel methodology that combines ML with ED is presented throughout the next two sections.
2.4. The Proposed Methodology
Firstly, a data set of images with their respective ground-truth is needed. The usual approach is that this ground-truth is made of “true” edges previously drawn by at least one human. The Berkeley Image Segmentation Data Set [28] is a well-known example that precisely works this way. The pixels labels are built with the information provided by the ground-truth as it is explained in the Subsection “Building a suitable ground-truth.”
For the learning process, in the literature, the usual approach for edge detection problem has been employing a few characteristics related to the intensity of each pixel if the image consists of a single channel, or a few intensities in the case of color edge detection. One of the advantages of using supervised ML techniques against nonsupervised methods is that they can deal with more information. Thus, more variables can be included in the process of classifying the edges, so a better classification is expected.
Traditionally, this classification of edges—called “extraction” of edges as well—is made by finding a suitable value for the parameters involved in the ED algorithm, which could be done by means of nonsupervised or supervised methods. In this research, multiple predictor variables are used together to provide the needed information for taking the decision pixel by pixel. These predictors are collected from the soft value provided by an edge detection algorithm as an input. In this sense, the proposed method is not competing with other existing algorithms, as it is working as a post-processing technique with the capability of improving them. This soft value is obtained right after the blending phase and before the scaling phase. The predictor variables that are being used for the ML models are specified in the “Collecting the predictor variables” Subsection.
As it is usual in ML, once the models are trained, the true label is not needed anymore, as the labels are being predicted employing the soft value as input. Due to this, once the model has been trained, the complexity barely increases over its original level that was reached to generate the input.
2.5. Building a Suitable Ground-Truth
To work with supervised classification, the use of labeled cases is needed. In IP, the use of ML techniques requires employing a labeled ground-truth as a reference that may provide the true classes.
Let be one of the K different humans made references, i.e., “sketches,” that are available for each image as a ground-truth. Then, for a specific human or “draftsman” a pixel is labeled as an “edge” if it was drawn by him/her, and as a “nonedge” if it was not drawn. As every image has K human references, they must be aggregated into one for ML purposes. In this research, the next aggregated label is used for the learning process: the pixel is considered as an edge if it has been drawn by at least h humans (Obviously, for applying this kind of aggregation a data set of images with more than one reference is needed. This is the case of Berkeley segmentation data set [19].). This image, which has been created as an aggregation of multiple humans, tends to avoid the subjective tendency of a single human, minimizing the likeliness of building wrong labels. However, this “raw” aggregated human reference is not going to be the definitive image from which the labels are built. The reason for this lies in the great difficulty for any human to draw with absolute precision the location of a certain “edge.” During the traditional matching process employed in Image Processing (see for example the method presented in [29]) to match a pixel that is being evaluated with its equivalent pixel in the human reference, a window of pixels is allowed. For example, a pixels window (i.e., 1 pixel deviation from the central or possible “true edge” pixel) it seems a natural range to allow some tolerance for classifying the candidate pixel as an edge in a permissive or a “soft” way. Because of this, the above created must be expanded or thickened to cover this window of pixels and converted into the definitive label image . An easy method to build this thickened image is through mathematical morphology [30]. More specifically, a dilation is applied over with a , which is a square of white pixels acting as a structural element.
The combination of both steps is made in this way: first, a dilation is applied over the different human references and then the aggregation of these dilated sketches is built. This process for building the definitive experiment’s ground-truth is mathematically expressed by equations (5) and (6), and it is shown in Figures 2 and 3:


2.6. Collecting the Predictor Variables
In this research, the predictor variables are extracted from the soft image , which results after Sobel operator is applied over a gray image, and before the binarization—scaling—is done. The type of variables being used is divided into four different categories:
2.7. Pixel Predictor Variables
These are the variables that use information related to the pixel. The next ones are the pixel predictor variables that have been used.(i): the intensity of gray channel which results after the blending phase (aggregation of directions) but before the scaling phase (see the ED phases in [31]). This value or intensity can be considered as the pixel quality traditionally called “edginess.”(ii): the potential edge angle which is computed as in the case of Sobel (see equations (1) and (2)). It is a circular variable, so a practical way to use the angle information when building the ML classifiers is to decompose this variable in two components: and .(iii): the pixel horizontal and vertical positions , which gives spatial information that is relevant for the classification.(iv): two variables for the minimum distances from the pixel position to the horizontal borders and the vertical borders, respectively.
2.8. Neighborhood Predictor Variables
These are the predictor variables related to the pixel neighborhood. The most natural neighborhood variable is the soft edge detection value of the neighbor pixels or any aggregation of these values. Clearly, for creating these variables, a specific window size for the neighborhood must be decided. In the case of this research a neighborhood grade of 1— pixels—is used, therefore, a pixel has 3, 5, or 8 different neighbors (depending on the pixel position). The neighborhood variables employed were:(i): the neighbor’s value of (ii): maximum, minimum, and average intensity of the 8 neighbor pixels(iii): maximum, minimum, and average intensity of the 3 neighbor pixels of the pixel’s row(iv): maximum, minimum, and average intensity of the 3 neighbor pixels of the pixel’s column(v): maximum, minimum, and average intensity of the 3 neighbor pixels of both diagonals
2.9. Segment Predictor Variables
This set of variables is obtained from the edge segments that result from connecting the pixel candidates of . In [20, 31] was presented the concept of edge segments. When a certain pixel did not belong to a segment, these variables values were set to 0. Some of the segment’s valuable information is related to its length, average position, rectangle containing it, and so on. For the experiment results of this paper the next segment variables have been employed:(i): length. For each segment . Therefore, it can be seen as the number of pixels in the segment.(ii): intensity mean. For each segment , , where represents the intensity of pixel inside the already thinned image (see ED phases in [31]).(iii): maximum edginess. For each segment , we obtain .(iv): standard deviation of the intensity. For each segment : .(v): “rule of thirds” position. For each segment , we obtain the coordinates of the pixel that occupies the central position in the segment:where is the average vertical position and is the average horizontal position of the pixels in :
Once the gravity center is computed, its Euclidean distance to the intersection points following the rule of thirds is obtained, which is a standard in photography composition [32]. This rule establishes that the most important objects inside an image are usually placed close to the intersection of the lines that divide the image into three equal parts. Following this principle, we computed the minimum of its four distances, as there are four intersection points created by these four lines.(i): the area of the minimum rectangle that contains the segment to which the pixel belongs.(ii): the belonging itself of the pixel to a certain segment. if otherwise.
2.10. General Predictor Variables
These are the variables whose value depends over the whole image, i.e., they provide general information about the image and not of a certain pixel. These variables must be used carefully as if each image had a different value for a certain general variable. This could be wrongly used by the ML method to identify the specific image, which would not be correct for classification purposes. An example of a general variable is the average soft value inside the image or another aggregation function over the soft values, the percentage of the pixels upper a given soft value, the image dimension, and so on.
In the experimental results of this paper, this type of variables has not been used. The reason for this is to avoid the image identification tendency that could affect badly the ML classification, especially when the training data set size is not big enough.
2.11. Collecting the Predictors and the Labels
Below is specified a pseudo-code that helps understanding the process of collecting the predictor variables and the labels for each pixel.(1)I = read(imagefile) #an image I is taken as input.(2)NumHumans = h #required number of humans for drawing (i, j) as an edge.(3)For every pixel (i, j) of I do(4)Compute variable 1 ( = “The intensity of gray channel ”)(5)Compute variable 2 ( = “The potential edge angle .”) …(42)Compute variable 36 ( = “The belonging itself of the pixel to a certain segment.“)(43)FeatureVector = (, …, )(44)Label(i, j) = 1 (else Label(i, j) = 0)(45)end
As a result of the above code compilation, a vector combining the feature vector and its respective label are created for each pixel. An example of this can be seen in Figure 4.

2.12. Using Machine Learning for Edge Detection Problems
The experiment was conducted following the next steps:(1)The 50 first images—sorted by number—of Berkeley segmentation data set [28] were employed (from 100075.jpg to 16052.jpg).(2)A modified version of the ground-truth of these images was created following the method explained in “Building a suitable ground-truth” Subsection (see Figure 2) and it was used for creating the pixel labels. It was employed (2 humans) and (the dilation was made with squares of pixels).(3)The predictor variables explained were extracted taking as input the soft value obtained after applying the Sobel operator. As a result of this step, a matrix with all cases—i.e., the pixels—as rows, and 36 predictor variables as columns was created (The size of this matrix was 7720050 rows × 36 columns.).(4)The 50 images set was split into two parts: the first 30 images were used as the training set, while the rest configured the validation set.(5)Three machine learning (ML) algorithms were employed over the training test: Logistic regression, Neural networks (NN) and Random forests (RF). A fourth algorithm was as well tested: the SVM, but the training set was smaller than with the other algorithms, and due to that, it has not been compared with the others. Dozens of models were tested, which were created by means of changing the values of the parameters of these algorithms. In the case of logistic regression all predictor variables were included, and three well-known methods for selecting the variables were employed: Forward Selection, Backward Elimination and their combination Stepward Selection [33]. Moreover, these methods were applied using different values. Finally, three different functions were used: Logic, Probit and Cloglog. NN models were all built with one single hidden layer but varying the number of nodes from one to seven. As well, three different activation functions were employed: Arctangent, Exponential and Hyperbolic tangent. The independent variables used for these NN models were those chosen by the best logistic model. The main intention for not employing all predictor variables was removing multicollinearity in NN models. Finally, RF models [27] used all possible predictors as this kind of models deals properly with multicollinearity. For the RF algorithm the number of trees was limited to 100, the number of variables per tree was ranged from 5 to 30, the train fraction was settled to 0.6, the leaf size ranged from 30 to 100000, the maximum depth was 50 and three different values were employed: 0.01, 0.05 and 0.1.(6)For the validation set the confusion matrix was computed. This was done 15 times as the classification is made using 15 different scoring thresholds (from 1% to 15% of the highest scorings). For each algorithm, the model that reaches the highest true positive rate is considered the best one.(7)Following a 10-fold cross-validation method (In the case of SVM models the training size was shortened to 25000 pixels for simplification purposes as computational costs of these models are higher than the others.), the 50 images were divided into 5 blocks of 10 images each, which resulted in 10 different combinations of training sets of 30 images and validation sets of 20 images.(8)For the 10 cross-validation combinations of the previous step, and for the best three models plus Sobel’s, three different F-measures were computed for each image: “F minimum,” representing the most different human, “F mean,” representing the average human, and “F maximum,” representing the closest human. The matching between the outputs and the humans was made allowing a tolerance of pixels window, following the procedure proposed in [29].(9)The different F values of each algorithm were compared.
We can see the scheme of the whole experiment in Figure 5.

3. Results and Discussion
Table 1 shows that the new proposed methodology based on the use of post-processing of edge detection output with machine learning techniques improves Sobel performance by ( against ). As an example of this, we can see in Figures 6–8 that Neural Network models outperform both, Random Forests and logistic models.



In Table 2 we can see that Logistic, RF, and NN performances were significantly better than Sobel’s performance. Moreover, between the ML algorithms, NN reached the best performance.
The best Logistic model found employed 19 variables: {}. Furthermore, it used the Logit function, The Stepwise method, a value of 10−14 and the best scoring threshold found was 8%. The best RF model found had a leaf size of 100, a maximum depth of 50, a value of 0.01 and the best scoring threshold found was 8%. Finally, the best NN model employed the same 19 variables listed above, used arc tangent as the activation function, 5 nodes, and the best scoring threshold found was 10%.
We can see in Figure 9 an example of three binarized images after learning the edges through the different classifiers. It can be appreciated that the best quality of edge extraction was reached by the Neural Networks method, followed closely by Random Forest. These edges can be considered “better” edges especially because of the strong continuity that they show.

4. Conclusions
The proposed methodology has interesting and promising results that deserve deeper research. Our postprocessing approach that makes use of machine learning techniques showed that it is possible to improve an edge detection output as Sobel’s. Furthermore, the same procedure seems to be easily adapted to many other ED algorithms.
The edges extracted by our method proved to be more relevant than the ones obtained through classic ED, as they managed to keep more valuable information related to the important objects of the image. This improvement was possible thanks to an intelligent modification of the original ground-truth, which can be considered an interesting novelty of the present research. Such an approach was possible after allowing thicker than one-pixel thick edges. The utility of creating these thickened or dilated edges is justified by the fact that human vision is not able of matching a certain edge to its exact pixel location.
An immediate extension of this research may explore the ED problem with any other ED algorithm as Canny’s [34]. In the case of the Canny algorithm this may require the inclusion of different predictor variables than with Sobel’s.
A natural extension of this research would be based on considering more channels/colors with the extra difficulty of aggregating these channels information in an intelligent and useful way. Another extension could consist of running more tests with SVM so the results can be similar to the other algorithms allowing suitable comparisons.
Finally, it seems promising the idea of developing deep learning methods that made use of single pixels as input.
Data Availability
The data used to support the findings of this study have been deposited in the data.world repository (https://data.world/pflores/edge-detection-with-machine-lerarning). The instructions to work with it are included in it. The set of images that is used in this experiment was developed for the first time in [19], and it can be downloaded from the resources available in Berkeley Computer Vision Group’s web: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html#bsds500.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
For the conducting of this research, the CODE created by Kermit Research Unit has been strongly helpful [5]. This research has been partially supported by the Spanish Ministry of Science (PGC2018-096509-BI00).