Abstract

With fast learning speed and high accuracy, extreme learning machine (ELM) has achieved great success in pattern recognition and machine learning. Unfortunately, it will fail in the circumstance where plenty of labeled samples for training model are insufficient. The labeled samples are difficult to obtain due to their high cost. In this paper, we solve this problem with transfer learning and propose joint transfer extreme learning machine (JTELM). First, it applies cross-domain mean approximation (CDMA) to minimize the discrepancy between domains, thus obtaining one ELM model. Second, subspace alignment (sa) and weight approximation are together introduced into the output layer to enhance the capability of knowledge transfer and learn another ELM model. Third, the prediction of test samples is dominated by the two learned ELM models. Finally, a series of experiments are carried out to investigate the performance of JTELM, and the results show that it achieves efficiently the task of transfer learning and performs better than the traditional ELM and other transfer or nontransfer learning methods.

1. Introduction

The fast development of mobile Internet, the Internet of things, and high-performance computers causes a large amount of data to emerge. How to mine the information of these data to help people make decisions has become a challenge. Machine learning uses numerous labeled data to train a statistical model for automatic prediction and this has become a hot topic in artificial intelligence (AI). As a high-performance model in machine learning, ELM has achieved success in pattern recognition, computation science, and machine vision. It includes the following two merits [14]: fast learning speed and outstanding generalization performance. There is no need for ELM to tune input weight and bias, and what it only needs is to optimize the output weight by solving a least square problem. Therefore, it has been widely recognized for classification and regression in various fields including industry fault diagnosis [5, 6], medical diagnosis [7], hyperspectral imagery classification [8, 9], facial expression recognition [10], and brain-computer interface [11, 12]. However, like the traditional machine learning model, ELM performs less satisfyingly when the training samples are insufficient.

Transfer learning (TL) can handle this problem, in which the account of labeled samples (data) from other domains (source domain) related to the current domain (target domain) are adopted to train an efficient model for helping target tasks [1315]. TL can not only reduce the cost of collecting training samples for data reuse but also enhance the generalization performance of the model. It is an expression of advanced intelligence. We commonly divide TL into three parts [14, 16], namely, instance-based transfer [1719], feature-based transfer [2023], and classifier (or parameter)-based transfer [2426]. Moreover, with the success of deep learning and adversarial network in computer version and machine learning, some deep transfer learning [27] and transfer adversarial learning approaches [28] appear to further enrich the transfer learning in theory and application.

TL could help ELM to solve the shortage of available training samples, and many variant ELMs with the ability of knowledge transfer have appeared. Depending on how to adapt between domains, we divide the transfer ELM (TELM) into the following three types. (1) The target supervised method: It usually requires a few of the labeled samples from the target domain to adjust the model training on the source domain. Domain adaptation extreme learning machine (DAELM) [29] was put forward to enhance ELM to handle the domain adaptation problems in the E-nose system. Online domain adaptation extreme learning machine (ODAELM) [30] and online weighted domain transfer extreme learning machine (OWDTELM) [31] extend DAELM to the online task. To further improve DAELM, Xia et al. [32] proposed the boosting for DAELM (BDAELM) which introduces boosting technology to ensemble DAELMs. (2) Parameter transformation or approximation: This method realizes the knowledge transferring across domains by a transform matrix or output weight approximation, such as transfer extreme learning machine with output weight alignment (TELM-OWA) [33], parameter transfer ELM (PTELM) [34], and extreme learning machine (ELM)-based domain adaptation (EDA) [35]. Li et al. [36] designed transfer learning based on the ELM algorithm (TL-ELM) by adding a constraint which forces the output weights of the two domains to be close to each other. (3) Statistical adaptation: It usually introduces a statistical distribution metric, such as MMD [37], into ELM to reduce the domain shift. Many methods including cross-domain extreme learning machines (CdELMs) [38], extreme learning machine based on maximum weighted mean discrepancy (ELM-MWMD) [39], and domain space transfer ELM (DST-ELM) [40] are applied MMD to reduce the distribution discrepancy of the output data in hidden layer from source and target domains.

In this paper, we propose a novel ELM called joint transfer extreme learning machine (JTELM) for transfer learning. It first obtains one ELM model by introducing cross-domain mean approximation (CDMA) [41] into ELM, in which CDMA could effectively minimize the marginal and conditional distribution differences between the two domains. Second, we apply subspace alignment technology [42] to align output weights of two domains and to simultaneously add the approximation term to force the output weights to be close to each other, which could boost knowledge transfer. Then, we can obtain the other ELM model. Finally, the target samples are tested by two learned ELMs. JTELM is illustrated in Figure 1. We carry out some experiments on public datasets for transfer learning tasks to estimate the performance of JTELM, and the result demonstrates the superiority of our method.

We summarize our contributions as follows:(1)CDMA measure is added to the objective function of ELM to reduce the distribution discrepancy of the output of hidden layers in the source and target domains, which could obtain one transfer ELM model.(2)We apply output weight alignment and the approximation of the output weights from the two domains to improve the efficiency of knowledge transfer and simultaneously to get the other transfer ELM.(3)We use the two obtained transfer ELMs to jointly predict test samples, which enhances the robustness of JTELM. To estimate the performance of our approach, we conduct classification experiments on object recognition and text datasets, and the result demonstrates that JTELM has a remarkable knowledge transfer ability.

We organize the rest of the sections of this paper as follows. ELM, CDMA, and SA are briefly described in Section 2. JTELM is described in detail in Section 3. Then,, the experiment is analyzed in Section 4 and the conclusion of this paper is presented in Section 5.

In this section, we briefly introduce ELM, CDMA, and SA.

2.1. Extreme Learning Machine (ELM)

ELM, as a single-hidden-layer forward-feedback network, randomly initializes the input weight and bias and then solves the optimal output weight, which leads to its fast learning speed and high accuracy. If a labeled dataset with samples of and a correspondent label is given, then we can construct a classic ELM model with nodes in a hidden layer in the following manner:where is the output of ELM according to the input samples , , and and these are the input weights and bias which are often randomly initialized. is a vector representation of the output weight. If we want an optimal , then the following loss function is solved:where is a parameter sparse constraint avoiding model overfitting. is the classification error and is its tradeoff parameter. We then convert equation (2) into the following matrix form:where , , and .

According to [2], we get the optimal as

Finally, we predict the testing sample aswhere .

2.2. Cross-Domain Mean Approximation (CDMA)

The distribution discrepancy measure is very critical in transfer learning. Zang et al. [41] presented CDMA that is nonparametric, easy to understand, efficient, and beneficial for mining local information. In transfer learning, there are two datasets: from the source domain and from the target domain, where is the number of in and belonging to classes which is the label of . Then, we can get the CDMA measure aswhere and is the mean vector of the target (source) domain sample. If we further consider the label information of the samples, CDMA is also represented aswhere and is the mean vector of the target (source) domain with category.

2.3. Subspace Alignment (SA)

In transfer learning, especially feature transfer, SA usually aligns two feature subspaces from the source and target domains obtained by other feature extraction methods, and realizes the distribution consistency of the two domains. If we have learned the two subspace transform matrixes and , then, a transformation matrix is obtained to solve the following function:where is the Frobenius norm. We add the orthogonalization operation into equation (8) and get

From equation (9), we can see that . We set and from this it is clear that the sample distribution in subspace is more similar to the one in than in , which facilitates knowledge transfer.

3. Joint Transfer Extreme Learning Machine (JTELM)

In response to the shortcoming of ELM with no ability of knowledge transfer, we propose a novel transfer ELM abbreviated as JTELM for handling unsupervised transfer learning tasks in which no labeled target samples appear. In unsupervised transfer learning, the source domain and are given but disappears in , so we expect that JTELM learned from to precisely predict the samples in

3.1. Extreme Learning Machine with CDMA

To equip ELM with the ability of knowledge transfer, we first use to map and into and , and then construct ELM with CDMA by introducing CDMA into the loss function of ELM. CDMA minimizes the distribution difference of data in output layer from the source and target domains. Then, we get the loss function of ELM with CDMA as follows:

In equation (10), the first two items are the loss of ELM, and the third item is the loss of CDMA in the output layer. is a tradeoff parameter between two losses. , , , is the pseudo label of in the label refinement process. represents samples with category in and . , is the mean vector of and , respectively. is the mean vector of the target (source) domain with category.

We can obtain one ELM with knowledge transfer ability by using according to [2] as follows:

3.2. Extreme Learning Machine with Output Weight Alignment and Approximation

Suppose that there is weight in the target domain, then we can construct a loss function as follows:where denotes the classification error in the source domain, denotes output weight approximation to force close to for the facilitation of knowledge transfer, and and are the balance parameters.

As a next step, we apply SA to align the output layer of the source ELM to target ones. First, we obtain a transform matrix , and set and then replace with and substitute it into equation (12) to get

At this moment, change into the source classification error under the output weight alignment. We substitute to equation (13) and get

Because , equation (14) can change to

We set and , then equation (15) can be simplified as

Let , then we obtain

3.3. Joint Decision

Up to now, we have learned two transfer ELMs with and , but the target samples are predicted by according to equation (5). We summarize JTELM in Algorithm 1.

Input: Dataset and , number of hidden-layer nodes , tradeoff parameter , , , and .
Output: Predicted result .
Step 1: Use and to calculate according to equation (11).
Step 2: Solve the output weight according to equation (17);
Step 3: Use and to predict and get its label .
Step 4: Repeat Steps 1–3 until no change on .
3.4. Discussion

Inspired by TELM-OWA [33], we put forward JTELM to address the problem of unsupervised transfer learning. It has the following characteristics:(1)Similar to TELM-OWA, output weight alignment (equation (8)) and weight approximation are used to learn a transfer ELM parameter , but JTELM is an unsupervised TL method in which no labeled samples in the target domain exist. Therefore, JTELM has a higher difficulty and challenge.(2)The authors in [41] have proved that CDMA is a more efficient distribution discrepancy metric than MMD. We apply it to ELM to add the transferring ability of knowledge from the source domain to the target domain. Thus, in equation (11) is a parameter of the shared model between domains.(3)JTELM utilizes and to jointly make decisions for test samples, which not only unify statistical adaptation and parameter transformation into a learning framework to improve knowledge transfer, but also enhances the robustness of our approach similar to ensemble learning.

4. Experiment and Analysis

In this section, we present the validity of our JTELM and perform some experiments on image and text datasets commonly used in transfer learning for the classification task. All experiments are run on a PC with 8 GB memory and Windows 10 operating system and MATLAB 2017b. Every experiment runs 20 times and the average value is recorded. We evaluate all algorithms in the experiment with an accuracy similar to [21].

4.1. Datasets Description

Office31 + Caltech256 (shown in Figure 2): These datasets were first published in the year referred in [43]. It contains two domains, namely, Office31 and Caltech256. Office31 consists of 4,652 images in 31 categories. They have been collected from 3 subdomains, that is, Amazon (A), DSLR (D), and Webcam (W). Caltech (C) is also an object image dataset consisting of 30,607 images from 256 categories.

During the experiment, we select 1,410 images with 10 categories from office31and 1123 images with 10 categories from Caltech. Every picture is extracted using SURF features with 800 dimensions. Two subdomains in A, W, D, and C are randomly chosen as the source and target domain datasets, and 12 cross-domain tasks are built as C⟶A, C⟶W, C⟶D, …, and D⟶W (shown in Table 1).

USPS + MNIST (shown in Figure 3): USPS and MNIST are the two image datasets describing numbers from 0 to 9, so they share 10 categories but have different distributions. USPS consists of 9,298 images with 16 × 16 pixels and MNIST has 70,000 images with 28 × 28 pixels. During the experiment, 1,800 pictures from USPS and 2,000 pictures from MNIST are selected as the source domain and the target domain (shown in Table 1). Every image is converted into 16 × 16 pixels and two cross-domain tasks, i.e., USPS vs. MNIST and MNIST vs. USPS are constructed for transfer learning tasks.

MSRC + VOC2007 (shown in Figure 4): MSRC is an object image dataset consisting of 4,323 images from 18 categories. VOC2007 is an image dataset with photos in Flickr, consisting of 5,011 images from 18 categories. They have similar but different distributions as can be seen in Figure 4. In this experiment, we collect samples from shared 6 categories of the two datasets, including aircraft, birds, cows, family cars, sheep, and bicycles. Then, we construct two transfer learning tasks: MSRC vs. VOC and VOC vs. MSRC, in which 1,269 images are selected in MSRC and 1530 images are selected in VOC2007. In addition, we rescale all images to 256 gray pixels in length and extract 240 dimensions as a new feature representation (shown in Table 1).

Reuters-21578: Reuters-21578 is a text dataset commonly used for text data mining and analysis. It has 21,577 news documents from 5 classes, such as “exchanges,” “orgs,” “people,” “places,” and “topics.” In this experiment, we select the largest three categories “orgs,” “people,” and “place,” and construct 6 transfer learning tasks, i.e., orgs vs. people, people vs. orgs, orgs vs. place, place vs. orgs, people vs. place, and place vs. people, as shown in Table 1.

4.2. Experimental Settings

We choose some classifiers to compare them with JTELM as follows:1NN: one nearest neighbor classifierSVM: support vector machine with linear kernel and penalty parameter belonging to ;ELM: standard extreme learning machineSSELM: semisupervised ELM with graph regularization [24]TCA1(2): TCA [20] + NN (SVM)JDA1(2): JDA [21] + NN (SVM); we set the dimension of the feature subspace in TCA and JDA and the value range of the sparsity constraint parameter of the projection matrix is ;DAELM_S, DAELM_T [29]: domain adaptation ELMsAELM [44]: ELM with feature augmentation (AELM). Its result in this section is cited from [44]ARRLS [45]: a general transfer learning framework. We set its parameters according to [45]TELM-OWA [33]: supervised transfer ELM. We set its parameters as referred in [33]CdELM-C [38]: unsupervised transfer ELM using MMD. We cited its results in [38] for comparison

In addition, we set the penalty parameter in ELM, SSELM, DAELM_S, DAELM_T, and TELM-OWA. In JTELM, we set , , , , and on Office + Caltech dataset, and on USPS + MNIST, and on Reuters-21578 dataset, and and on MSRC + VOC2007 datasets.

To evaluate DAELM_S, DAELM_T, and TELM-OWA in the experiment for unsupervised transfer learning, we select few target samples such as 0.5% labeled target samples on USPS + MNIST, MSRC + VOC2007, and Reuters-21578 datasets and 1% labeled target samples on Office + Caltech dataset to train models.

4.3. Results and Analysis

To investigate the performance of JTELM, we carry out classification tasks on image and text datasets including Office + Caltech, USPS + MNIST, MSRC + VOC2007, and Reuters-21578 datasets, and the results are reported in Table 2 and 3. From the results, the following are the observations:(1)JTELM has the highest accuracy in the total average of all algorithms in Tables 2 and 3. It, respectively, gains improvement of 10.54% and 8.73% compared to the baseline ELM in Tables 2 and 3, indicating that our method has a better ability of knowledge transfer with help of CDMA, output weight alignment, and weight approximation. It enriches ELM in theory and application.(2)TELM-OWA and DAELM, as supervised transfer learning mechanisms which requires parts of the labeled target samples, are not ideal under unsupervised learning. TCA, JDA, ARRLS, and CdELM-C apply MMD to reduce the distribution discrepancy of the two domains and to gain good results. SSELM utilizes graph regularization to explore the information of unlabeled target samples and it performs well.(3)TCA1 (2) and JDA1 (2) implement the classification task by combining the transfer feature extraction methods(TCA and JDA) with baseline classifier, therefore they outperform 1NN and SVM. ELM performs slightly better than 1NN and SVM due to its good generation ability.

In Table 4, we test the running time of some compared algorithms and JTELM and record it. It can be seen that (1) ELM has the least running time among all methods without tuning the input weight and bias. (2) DAELM_S and DAELM_T are slightly higher than ELM because of the participation of part of the target samples in the training model. TELM-OWA consumes more time than ELM, DAELM_S, and DAELM_T because of solving the two weights which is similar to JTELM. JTELM has more time to solve weights and refine the target pseudo labels. (3) JDA has the most running time because it requires the construction of MMD scatter, feature extraction, and label refinement. TCA (1, 2) and JDA (1, 2) cost more time than 1NN and SVM to extract cross-domain features. JDA spends more time than TCA for label refinement. (4) Due to construction of the Laplacian matrix, SSELM costs more time than ELM.

4.4. Ablation Study

In JTELM, three mechanisms including CDMA, output weight alignment (OWA), and weight approximation (WA) are applied to realize knowledge transferring. Thus, jointly making a decision needs to be executed under output weight alignment, so that we can regard both output weight alignment and jointly making a decision as OWA. It mainly depends on the parameter to adjust its influence. The impact of different combinations of the three mechanisms on JTELM’s accuracy is shown in Table 5, and it shows that (1) both CDMA and OWA can independently enhance ELM’s knowledge transfer capability. (2) Combining CDMA and OWA for knowledge transfer is better than their own. (3) WA is also beneficial to knowledge transfer. Moreover, we compare ELM, ELM-CDMA, and JTELM in time consumption. ELM-CDMA is a special situation of JTELM, without output weight alignment and weight alignment, and the result in Table 6 shows that CDMA, OWA, and WA could help ELM perform better in transfer learning, but they need more time-cost.

4.5. Parameter Analysis

We investigate the sensitivity of JTELM to parameters , , , and and to the number of hidden-layer nodes and its convergence and perform experiments on org vs. people, MSRC vs. VOC, MNIST vs. USPS, and A vs. D datasets. The results are shown in Figures 5(a) to 5(f) and the following observations were the observations made: (1) With , , , and increasing, the accuracy of JTELM first rises up and then falls on 4 datasets, as shown in Figures 5(a) to 5(d). We can see that CDMA, the source classification error with OWA and output weight approximation, when adjusted to the appropriate range, can improve accuracy and the knowledge transfer ability of ELM in transfer learning mechanism. (2) As shown in Figure 5(b), the accuracy first increases and then slightly decreases on 4 datasets, with the number of growing. When increases, the nonlinear approximation of our network will perform better. However, for some datasets, a larger may maximize the distribution discrepancy of the output data for the hidden layer from two domains, leading to model’s poor performance. (3) We also observed the accuracy varying with the iteration number in Figure 5(f). It shows that the accuracy of JTELM gradually becomes stable and finally converges after 10 iterations.

5. Conclusion

In this paper, we propose JTELM to address the problem that ELM degrades in transfer learning. It first applies CDMA to ELM and one transfer ELM model is learned. Then, similar to TELM-OWA, it uses output weight alignment and out weight approximation to learn the other transfer ELM on the source domain. Finally, it adopts two learned transfer ELMs to predict the samples from the target ones. Extensive experiments have been performed on the open image and text datasets, and the results show that JTELM has a higher accuracy and strong knowledge transfer ability than several state-of-the-art classifiers.

Data Availability

The data used to support the findings of this study are found in https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation for Distinguished Young Scholars of China under Grant No.11905244, in part by the Key Scientific Research Projects of Universities in Henan Province under Grant No. 22A120005, the National Aviation Fund Projects under Grant No. 201701420002, and the Henan Province Key Scientific and Technological Projects under Grants No.222102210095 and 212102210153.