Abstract
In this paper, we apply kernel methods to deep convolutional neural network (DCNN) to improve its nonlinear ability. DCNNs have achieved significant improvement in many computer vision tasks. For an image classification task, the accuracy comes to saturation when the depth and width of network are enough and appropriate. The saturation accuracy will not rise even by increasing the depth and width. We find that improving nonlinear ability of DCNNs can break through the saturation accuracy. In a DCNN, the former layer is more inclined to extract features and the latter layer is more inclined to classify features. Therefore, we apply kernel methods at the last fully connected layer to implicitly map features to a higher-dimensional space to improve nonlinear ability so that the network achieves better linear separability. Also, we name the network as fully kernected neural networks (fully connected neural networks with kernel methods). Our experiment result shows that fully kernected neural networks achieve higher classification accuracy and faster convergence rate than baseline networks.
1. Introduction
Recently, deep convolutional neural networks (DCNNs) have achieved tremendous success in a series of computer vision tasks, such as image classification [1–4], object detection [5–8], semantic segmentation [9–11], and video caption [12]. These tasks benefit from very deep models. Then, a question arises: does adding more layers to a DCNN lead to better performance? Unfortunately, the performance of DCNN comes to saturation when the depth and width of network are enough and appropriate, and adding more layers to a suitably deep model leads to higher training error, as reported in [13, 14]. In [2], ResNet is proposed to address the degradation problem. ResNet is able to go deeper and achieve better classification accuracy than corresponding baseline network. ResNet achieves its best accuracy of 93.57% on CIFAR10 dataset [15] at depth 110. However, deeper ResNet with a depth of 1202 achieves only 92.07% classification accuracy on CIFAR10. Wide residual networks (WRNs) [16] increase width of ResNet and outperform their commonly used thin and very deep counterparts. However, the classification accuracy of WRN on CIFAR10 indicates that WRN achieves its best accuracy at some width and keeping to increase the width leads to accuracy drop. For example, WRN 28-12 and WRN 28-10 have the same depth and WRN 28-12 is wider and has 1.44 times parameters than WRN 28-10. But the accuracy of WRN 28-12 (95.67%) is slightly lower than that of WRN 28-10 (95.83%). These experiment results of ResNet on CIFAR10 show that the wider and deeper DCNN is not always the better.
Deeper and wider DCNN extracts more features indeed. But why do more features not lead to better classification accuracy? One possible reason is that a DCNN achieves better performance depending on not only the ability of feature extraction [17, 18] but also the ability of feature utilization. Suppose that a very deep and wide DCNN can extract various features, but the DCNN may not be able to make full use of these features. As a result, it cannot reach a satisfactory performance. The reasons are manifold. First, deep learning models are criticized to be black boxes because it is difficult to make adjustments due to the poor mechanism study of deep learning. Besides, DCNNs are sometimes so complex and are difficult to converge. When the training data are extremely complex, DCNNs performs poor even the network is very deep and wide. Due to the abovementioned drawbacks, we are not sure whether extracted features are utilized sufficiently. Therefore, we seek to enhance feature utilization ability of DCNNs.
Compared with deep learning methods, traditional machine learning methods [19–23] have better explainability. A typical machine learning process is usually accomplished by selecting a hypothesis space and learning a function estimator. The hypothesis space selection process is accomplished by choosing a subset of original features and it is optimally reduced according to a certain evaluation criterion. The feature extraction process focuses on designing preprocessing pipelines and data transformations that result in a tractable representation of data [24]. For example, support vector machine (SVM) is a typical machine learning method which aims at finding the best separation hyperplane in the feature space by maximizing the distance between the positive and negative samples of the training set. Except for performing a linear classification, SVMs can efficiently perform a nonlinear classification by using kernel methods and implicitly map data to a higher-dimensional feature space. The Cover’s theorem [25] proves that complex data are more likely to be linearly separable in a higher-dimensional space than that in a lower-dimensional space. Therefore, in this paper, we seek to enhance the nonlinear ability of DCNN by introducing kernel methods.
Kernel methods [26] can implicitly map data from its original feature space to a higher-dimensional feature space where it has higher probability to be linearly separable. We find that combining kernel methods to DCNNs is beneficial to improve nonlinear ability of the network. In a DCNN, the former layer is more inclined to extract features and the latter layer is more inclined to classify features. Hidden layers extract distributed features hierarchically as the DCNN grows from shallow to deep. It is worth mentioning that the last fully connected layer not only extracts high level features but also maps learned distributed feature representation to feature space as a classifier. Compared with features extracted by a convolution layer which are local and spatial weakly, features extracted by the last fully connected layer are global and contain spatial information. Therefore, in this paper, we apply kernel methods at the last fully connected layer to extract high level features from a higher-dimensional feature space to achieve better network nonlinear ability. In detail, we apply kernel functions to input features and weights of the last fully connected layer. Kernel functions can take different forms: polynomial kernel or Gaussian kernel. Based on this layer, we name it fully kernected layer (FKL) and propose fully kernected neural networks (FKNNs).
There are respective advantages of machine learning methods and deep learning methods. Machine learning methods are well established, transparent, and optimized for performance and power efficiency, while deep learning methods offer greater accuracy and versatility at the cost of large amounts of computing resources. The combination of them takes advantage of both them. On the one hand, deep learning methods extract adaptive features hierarchically through different layers. On the other hand, kernel methods have an explicit classification ability and can speed up the classification process. The contributions of this paper are summarized as follows:(i)We propose FKNNs which map high level global features to a higher-dimensional feature space by applying kernel methods at the last fully connected layer(ii)The nonlinear ability of FKNN is increased compared with baseline network(iii)FKL is easy to be implanted into various deep neural networks and achieves better classification accuracy, such as ResNet, DenseNet, and GoogLeNet
The rest of this paper is organized as follows. In Section 2, we introduce some related works. In Section 3, we introduce details of fully kernected layers. Section 4 conducts a serious of experiments on different datasets. In Section 5, we present some discussions and conclusions for our study.
2. Related Work
2.1. Activation Functions
Activation functions are widely used in DCNNs to perform nonlinear mappings between inputs and outputs. Rectified linear unit (ReLU) [27] has been the most popular and commonly used nonlinear activation function across various DCNNs. However, ReLU contains several shortcomings that sometimes result in inefficient training, such as negative cancellation property, multilinear structure, and highly positive mean. Several works are proposed to overcome these shortcomings. Leaky ReLU (LReLU) is proposed to solve the negative cancellation property of ReLU, which avoids a vast of neurons in the network becoming inactivated permanently. Exponential linear unit (ELU) [28] is proposed to deal with the bias shift effect of ReLU, which pushes the mean activation toward zero. Swish [29] is a nonpiecewise activation function based on Sigmoid function, which achieves better expressive power and faster convergence than ReLU. Parametric Flatten-T Swish (PFTS) [30] is an adaptive nonlinear activation function, which manifests higher nonlinear approximation power during training.
The abovementioned activation functions are usually used behind hidden layers of a DCNN, such as convolutional layers and batch normalization layers (BNs) [31]. In a classification task, the activation function used behind the last fully connected layer is always softmax for the purpose of classification. Output feature of the last fully connected layer can be regarded as a point at a multidimensional feature space. The softmax loss maps this point from real number region to (0, 1) probability region for better interpretability and operability. This process does not change the probability rank of which category this point belongs to. Different from the effect of softmax loss, our FKL is applied to input features and weights of the last fully connected layer and performs a higher-dimensional feature space and corresponding feature classification hyperplane. The main purpose of FKL is to achieve better classification performance by mapping high level global features to a higher-dimensional feature space, which is different from general activation functions, such as ReLU and softmax.
2.2. Kernel Methods
Kernel methods [32, 33] are widely used in machine learning, as most machine learning problems are nonlinear problems and often require nonlinear methods to map data to a high-dimensional feature space. As long as we can formulate features in terms of kernel evaluations, we do not need any computation in the high-dimensional feature space directly, even infinite dimensional feature space. Predefined kernel functions are employed empirically according to tasks, such as polynomial kernel functions and Gaussian kernel functions. Some approaches [34, 35] take advantage of deep neural networks (DNNs) to learn kernel functions from data for specific tasks. The authors in [34] transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting algebra for a scalable kernel representation to replace standard kernels. The authors in [35] propose a deep kernel method that can automatically learn a kernel function from data using deep belief networks to relieve the burden of users from selecting, defining, and configuring a kernel function for applying kernel methods. These learned kernel functions have competitive performance compared with predefined kernel functions and relieve the burden of selecting an appropriate predefined kernel function. However, these methods focus on specific problems and the generalization of these methods is poor.
Other approaches attempt to find the inner relation between DNNs and kernel methods. The authors in [36] regard deep networks as a sequence of deeper and deeper kernels and presents a method for analyzing DNNs that combines kernel methods and descriptive statistics in order to quantify the layer-wise evolution of the representation in DNNs. Their framework expresses the relation between representation built in the DNNs and the learning problem. The authors in [37] present a new family of positive-definite kernel functions that mimic the computation in DNNs which can be used in shallow architectures, such as SVMs, or in deep kernel-based architectures, such as multilayer kernel machines (MKMs). However, these approaches only mimic neural networks with simple architectures and have difficulty while dealing with popular deep neural networks with complex architectures, such as residual blocks [2] and dense blocks [3].
Further approaches combine DNNs with kernel methods to take full advantage of both them by replacing softmax classifier of DNNs with SVMs (especially linear SVMs) after the former part of DNNs extracts high-level features. The authors in [38] present a hybrid system where a convolutional network is trained to detect and recognize generic objects, and a Gaussian-kernel SVM is trained from features learned by the convolutional network. The author in [39] optimizes the primal problem of SVMs by using an L2-SVM which is differentiable so that the gradients can be backpropagated from the top layer of L2-SVM to learn lower level features. These approaches improve the performance to some extent. However, the networks are not end-to-end and lower level features have not been finetuned generally. In addition, as DNNs grow deeper and have more complex architectures, it becomes more difficult to obtain good features by training.
2.3. Kervolutional Neural Networks
Kervolutional neural networks [40] introduce a new operation, kervolution (kernel convolution), to generalize convolution via kernel methods. It enhances the model capacity and captures higher order interactions of features. It is an end-to-end network and has satisfactory compatibility with popular convolutional neural networks. However, the learned features in a convolutional layer are local and spatially weak. The same features may belong to different categories in different situations. Global information is of significant importance in image classification tasks. Different from kervolutional neural networks, FKNN maps high level global features of the last fully connected layer to a higher-dimensional feature space. FKNN aims at finding a more appropriate feature space for feature classification to enhance network nonlinear ability.
3. Fully Kernected Layer
From early single-layer perceptrons to multilayer perceptrons, all the hidden layers are fully connected layers. In recent years, fully connected layers are still widely utilized in most popular deep neural networks, such as ResNet and DenseNet. They are mainly used as classifiers at the end of the network. In a DNN, the last fully connected layer extracts high-level features and maps the learned distributed feature representation to sample space as a classifier. A fully connected layer with input and output is formulated as follows:where is matrix multiplication operation, is the parameter of the fully connected layer, and is the bias. Formula (1) can be regarded as a linear weighting between the parameter and the input . In this paper, we introduce fully kernected layers by applying kernel methods to fully connected layers. The output of the fully kernected layer is defined as follows:where is the kernel operation.
3.1. Kernel Functions
According to definition (2), we can choose different kernels to derive different formulas of fully kernected layers. Polynomial kernel function is defined as follows:where is the balance vector to balance the proportion of high order terms and is the order of the polynomial kernel, then definition (2) becomes
Larger leads to higher nonlinear order. is a non-negative parameter to decide the proportion of different order terms. Smaller leads to larger proportion of high order terms. If , there is no linear terms and definition (2) becomes
If is larger, the proportion of nonlinear terms becomes less. Gaussian kernel function is defined as follows:where is a hyperparameter to control the smoothness of decision boundary, then definition (2) becomes
In this paper, we recommend a polynomial kernel with order 2. Our experiment results show that a polynomial kernel with order 2 achieves the best accuracy among all the kernel functions.
3.2. Model Capacity
A fully connected layer achieves its classification ability by matching features. According to formula (1), the element of the output for a fully connected layer is calculated as follows:where is inner product of two vectors, is the column of , and is the element of . It measures the similarity between input and parameter , sincewhich calculates the cosine angle between two vectors.
According to definition (2), the element of the output for a fully kernected layer is calculated as follows:
For a polynomial kernel with order 2, formula (10) becomeswhere is the element of the balance vector . Except for the linear feature matching of fully connected layers, on the other hand, fully kernected layers achieve nonlinear features matching via kernel methods.
Global feature extraction is a core advantage of the fully connected layer, which can greatly reduce the influence of feature location for classification. According to definition (2), the fully kernected layer takes advantage of all the input information, which retains this advantage.
3.3. Inheritance Compatibility
Fully connected layers have been widely used in most DCNNs and have Python packages and defined classes in many frames, such as TensorFlow and PyTorch. To be compatible with existing works, we implement a library of fully kernected layer class, which is inherited from existing defined fully connect layer class. Besides the parameters needed to decide in the fully connected layer class, such as the number of input feature maps, the number of output feature maps, and bias, users only need to decide the kernel parameters. According to definition (2), the fully kernected layer introduces no more parameters except the kernel parameters. Taking a polynomial kernel of degree 2 as an instance, it only introduces the balance vector as an extra parameter. If we set fixed, the layer introduces no more parameters. In addition, the running efficiency of the inherited layer is similar to that of the original fully connected layer.
4. Experiment
In this section, we show the experiment result of FKNNs. For fair comparison, we only replace the last fully connected layer of baseline network with fully kernected layer and all the other experiment settings are kept the same. First, we evaluate how the depth and width influence the performance of the network. Next, we compare our approach with many other representative approaches on some extensively used datasets. Finally, we compare the convergence rate and running time of FKNNs with their corresponding baseline networks. All the following experiments are conducted on a single GPU of Nvidia Titan XP.
4.1. Depth and Width
The depth and width of DNNs are of crucial importance. The authors in [13] reveal that adding more layers to a suitably deep model leads to higher training error, which is not caused by overfitting. The degradation problem indicates that not all DNNs are similarly easy to optimize. In addition, we find the width of DNNs is of crucial importance as well. Therefore, finding an appropriate network configuration is crucial to enhance the performance of DNNs.
This section aims to find an appropriate network architecture for FKNN. We evaluate accuracy on CIFAR10 dataset [15]. CIFAR10 and CIFAR100 datasets consist of colored natural images with pixels in 10 and 100 classes, respectively. Each dataset contains 50k images for training and 10k images for testing. Also, we choose ResNet as the baseline network since ResNet has solved the degradation problem successfully to some extent and it has different architectures from a large range of depth. We experiment on a series of ResNet with depth 10, 18, 34, 50, and 110. In addition, we search the best width for each different depth. For all architectures, we replace the last fully connected layer with fully kernected layer to construct corresponding fully kernected ResNet. We apply a stochastic gradient descent (SGD) optimizer for training, with a mini-batch size of 64, a momentum of 0.9 and a weight decay of . We train the networks for 200 epochs with an initial learning rate 0.01 and the learning rate decays by 0.1 at 100 and 150 epochs, respectively.
The series of ResNet with depth 10, 18, 34, 50, and 110 have four stages with width 64, 128, 256, and 512, respectively. We change the width of fully kernected ResNet by doubling or halving the width of each stage. Table 1 shows all the detailed accuracy of fully kernected ResNet with different depths and widths. Due to GPU memory limitation, we do not conduct experiment on ResNet-50 and ResNet-110 with width 256, 512, 1024, and 2048. At each depth, the classification accuracy of the fully kernected ResNet tends to rise first and then fall as width increases. Considering the best accuracy of each depth, ResNet-18 achieves the best accuracy among all the depths. Also, the accuracy drops gradually with the depth increasing from 18 to 110. To sum up, an appropriate network architecture is crucial to the network performance and we recommend an architecture with depth 18 and width 128, 256, 512, and 1024 for fully kernected ResNet as it achieves the best accuracy among all the architectures.
4.2. Constructing FKNN on Representative Network Architectures
In this section, we apply our fully kernected layers to some representative DNN architectures, such as LeNet-5, ResNet, DenseNet, and GoogLeNet. Also, we choose KNNs as a comparison, which have significant improvement on these architectures. We evaluate accuracy on CIFAR10 and CIFAR100 dataset.
For the series networks of ResNet experiment, including ResNet, ResNet with KNN method (applying kernel methods at convolutional layers) [40], and ResNet with our FKNN method (applying kernel methods at fully kernected layers), we train the networks for 200 epochs with an initial learning rate 0.01 and the learning rate decays by 0.1 at 100 and 150 epochs, respectively. We apply a stochastic gradient descent (SGD) optimizer for training, with a mini-batch size of 64, a momentum of 0.9 and a weight decay of . For the series networks of GoogLeNet, we train the networks for 200 epochs with an initial learning rate 0.1 expect for GoogLeNet with FKNN method with 0.01 since a large initial learning rate for GoogLeNet with FKNN method leads to nonconvergence and the learning rate decays by 0.1 at 75, 125, and 150 epochs, respectively. The optimizer is the same as that of ResNet series experiment. For the series networks of DenseNet, we train the networks for 300 epochs with an initial learning rate 0.1 expect for DenseNet with FKNN method with 0.01 and the learning rate decays by 0.1 at 150 and 225 epochs, respectively. The optimizer is also the same as that of ResNet series experiment.
We show the best results of ResNet, DenseNet, and GoogLeNet as well as their corresponding KNNs and our FKNNs with respective best architecture in Tables 2 and 3. The results on ResNet and CIFAR10 show that our FKNNs achieve 0.42% better accuracy surpassing KNNs and even more promotion of 4.55% compared with CNN baselines. The results on ResNet and CIFAR100 show that our FKNNs achieve 2.08% better accuracy than KNNs and a 7.67% accuracy promotion compared with CNN baselines. KNN has an accuracy degradation on two groups of experiment, GoogLeNet-CIFAR10 experiment and DenseNet-CIFAR100 experiment. Our FKNN still has promotion on these two groups of experiment, even a 3.59% accuracy promotion on DenseNet-CIFAR100 experiment.
4.3. Convergence Rate and Running Time
We evaluate convergence rate on LeNet-5 and MNIST [41] dataset. MNIST dataset contains 60k training handwritten digits and 10k testing hand-written digits of 10 classes (0–9). The accuracy of MNIST on most DNNs has been saturated to 100%; thus, we adopt the evaluation criteria proposed in DAWN-Bench [42] that jointly considers the computational efforts and precision. It measures the time to a target validation accuracy (98.5%), which is a trade-off between efficiency and accuracy. We apply a stochastic gradient descent (SGD) optimizer for training, with a mini-batch size of 128 and a momentum of 0.9. We train the networks for 20 epochs with learning rate 0.003.
We compare FKNNs with baseline LeNet-5 network and LeNet-5 with KNN method. For KNNs, we choose polynomial kernels and set the hyper-parameters and , which achieves the best accuracy as their paper indicates. For our FKNNs, we apply two-order polynomial kernels and set the parameter differently to control the proportion of different order terms. As shown in Figure 1, our FKNNs converge to a validation accuracy of 98.5% much faster than baseline LeNet-5 network and slightly faster than KNNs. Smaller leads to a larger proportion of high order terms and has faster convergence rate which further indicates that high order terms accelerate the convergence rate of the network. Table 4 shows training time per epoch and the training time to the target accuracy (98.5%). The reported training time includes the testing and checkpoint saving time.We further evaluate training time and convergence time of ResNet and DenseNet on CIFAR10 and CIFAR100 dataset. We compare FKNNs with baseline ResNet and DenseNet and their corresponding KNNs. In Tables 5 and 6, although the training time per epoch of FKNN is only slightly smaller than that of baseline CNN and KNN, the convergence rate of FKNN is 1.31∼1.48 times faster than that of baseline CNN and 1.15∼1.30 times faster than that of KNN.

5. Conclusion
This paper introduces fully kernected neural networks (FKNNs) by applying kernel methods at the last fully connected layer of DCNNs to extract global features in a high dimensional space and enhance the nonlinear ability of the network. FKNNs not only retain the advantage of global feature extraction but also achieve nonlinear features matching besides linear features matching via kernel methods. The experiment result shows that FKNNs have a significant improvement compared with baseline networks. In addition, FKNNs have faster convergence rate and consume less training time. For future work, we will utilize different kernel functions.
Data Availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61773367, 61903358, and 61821005, in part by the Natural Science Foundation of Liaoning Province under Grant no. 2021-BS-023, and in part by the Youth Innovation Promotion Association of the Chinese Academy of Sciences under Grant no. Y202051.