Abstract

Extreme Learning Machine (ELM) is widely used in various fields because of its fast training and high accuracy. However, it does not primarily work well for Domain Adaptation (DA) in which there are many annotated data from auxiliary domain and few even no annotated data in target domain. In this paper, we propose a new variant of ELM called Discriminative Extreme Learning Machine with Cross-Domain Mean Approximation (DELM-CDMA) for unsupervised domain adaptation. It introduces Cross-Domain Mean Approximation (CDMA) into the hidden layer of ELM to reduce distribution discrepancy between domains for domain bias elimination, which is conducive to train a high accuracy ELM on annotated data from auxiliary domains for target tasks. Linear Discriminative Analysis (LDA) is also adopted to improve the discrimination of learned model and obtain higher accuracy. Moreover, we further provide a Discriminative Kernel Extreme Learning Machine with Cross-Domain Mean Approximation (DKELM-CDMA) as the kernelization extension of DELM-CDMA. Some experiments are performed to investigate the proposed approach, and the result shows that DELM-CDMA and DKELM-CDMA could effectively extend ELM suitable for domain adaptation and outperform ELM and many other domain adaptation approaches.

1. Introduction

It is a challenge for mankind to extract or mine valuable information from massive data generated from mobile devices, Internet, and industrial sensors in the current society. Classifiers based on machine learning play an important role in data and information processing systems. Extreme Learning Machine (ELM) [1] attracts attention due to its faster learning and higher accuracy compared with k Nearest Neighbor (kNN) [2], Back-Propagating (BP) [3], Naive Bayes (NB) [4], Support Vector Machine (SVM) [5], and Decision Tree (DT) [6], and has been widely promoted in many fields including image classification [7], traffic system [8], COVID-19 detection [9], fault diagnosis [10, 11], hyperspectral remote sensing images [12, 13], industrial sensors [14], facial expression recognition [15], and brain-computer interface (BCI) [16, 17] etc.

Due to its fast learning and strong generalization capability, Extreme Learning Machine attracted more attention. It randomly selects the input weights and the biases of hidden layer neurons without training data, and only obtains the optimal solution of output weight by minimizing the training error and the norm of output weights simultaneously [18]. Since the hidden layer parameters are randomly initialized and the output weights are solved by finding a least squares solution, the training time of the model is greatly reduced [1]. Recently, many variants of ELM have emerged for improving its performance and have been divided into three parts: supervised ELMs, semi-supervised, and unsupervised ones. Supervised ELMs need numerous labeled data to ensure its high performance, such as Kernel Extreme Learning Machine (KELM) [19], Weighted Extreme Learning Machine (WELM) [20], Twin Extreme Machines (TELM) [21], and Adaptive Regularized Extreme Learning Machine (A-RELM) [22]. Semi-supervised ELM usually requires unlabeled data together with labeled data to train models well, including Laplacian Twin Extreme Learning Machine (Lap-TELM) [23], Semi-Supervised Extreme Learning Machine (SS-ELM) [24], Robust Semi-Supervised Extreme Learning Machine (RSS-ELM) [25], and Adaptive Safe Semi-Supervised Extreme Learning Machine (AdSafe-SSELM) [26]. In some other cases where no labeled data are available, some Unsupervised ELM (USELM) algorithms are proposed for clustering, dimension reduction, or data representation, such as Unsupervised Extreme Learning Machine (USELM) [24], Extreme Learning Machine as an Auto-Encoder (ELM-AE) [27], Enhanced Unsupervised Extreme Learning Machine (EUELM) [28], and Unsupervised Feature Selection based Extreme Learning Machine (UFSELM) [29]. Moreover, as deep learning has been successful in many fields, deep ELMs are also developed to extract more abstract and expressive features, such as Kernel-based Multi-Layer Extreme Learning Machine ELM (ML-KELM) [30], Hierarchical-ELM (H-ELM) [31], DS-ELM (a deep and stable extreme learning machine) [32], and Deep Residual Compensation Extreme Learning Machine (DRC-ELM) [33]. Although the above algorithms expand the application of ELM in various scenarios, they perform not well when training data are not related to the testing data.

In application, it is difficult to collect a large amount of data which keeps consistent with test data because of condition change, data noise, view variety, and so on. To address this issue, Domain Adaptation (DA) [3436], as an important transfer learning technology, can remedy this shortcoming of ELM, in which abundant labeled data from other domains (source domain) is applied to help the current domain (target domain) and there is inconsistency in distribution between domains. Consequently, L. Zhang and D. Zhang [37] provided two ELMs with domain adaptation including Source Domain Adaptation Extreme Learning Machine (DAELM-S) and Target Domain Adaptation Extreme Learning Machine (DAELM-T). They can improve the generalization ability of ELM in multiple domains. Similar to DAELM, Zhao and Chen [38] also developed One Stage-Transfer-Learning ELM (OSTL-ELM) and Two-Stage-Transfer-Learning ELM (TSTL-ELM) to handle the problem of insufficient labeled data from target domains in aero engine fault diagnosis tasks. With the help of subspace alignment and the weight approximation, Zang et al. [39] put forward Transfer Extreme Learning Machine with Output Weight Alignment (TELM-OWA) to handle domain adaptation. TL-ELM (Transfer Learning-based ELM) was proposed by Li et al. [40], in which an output weight from two domains were forced to be close to each other for knowledge transfer. In the methods mentioned above, it is a requirement that there are few labeled data in target domains. However, annotating data are costly, laborious, and time-consuming. Consequently, some unsupervised models appear for unsupervised domain adaptation. Chen et al. [41] presented an transfer ELM, in which output weight alignment was applied to reduce domain bias and L2, 1-norm was imposed on output weight to enhance feature selection. For minimizing the distribution discrepancy between source and target domains, Li et al. [42], Chen et al. [43], and Zang et al. [44] utilized Maximum Mean Discrepancy (MMD) [45] to promote knowledge transfer in their respective models. Due to the insufficient target sample labels, the performance of unsupervised models is usually lower than that of supervised models, but it is hard to collect labeled target samples.

In this paper, we propose a novel ELM called Discriminative Extreme Learning Machine with Cross-Domain Mean Approximation (DELM-CDMA) for unsupervised domain adaptation. It introduces Cross-Domain Mean Approximation (CDMA) [36] into the objective function of ELM to jointly adapt the marginal and conditional distributions discrepancy between the output of hidden layers from source and target samples. CDMA could enhance the ability of transferring knowledge across domains. Moreover, to improve the discrimination, Linear Discriminant Analysis (LDA) [46] is also added into the objective function. It separates the samples with different categories and clusters the samples with the same category, which can improve the accuracy of ELM. Finally, we solve the designed objective function and obtain an optimal ELM model for domain adaptation. Moreover, we further present a Discriminative Kernel Extreme Learning Machine with Cross-Domain Mean Approximation (DKELM-CDMA) which not only kernelize DELM-CDMA suitable for nonlinear data but also eliminate the sensitivity of its accuracy to parameter initialization. DELM-CDMA is illustrated in Figure 1. Extensive experiments are carried out to verify the effectiveness and superiority of DELM-CDMA and DKELM-CDMA for unsupervised domain adaptation.

Contributions of this paper are as follows:(1)Joint distribution adaptation based on CDMA measure is introduced into ELM to narrow distribution differences between domains. We apply LDA to boost the class separability of DELM-CDMA and thus improves its accuracy.(2)As a kernel version of DELM-CDMA, it is proposed to enhance the robustness for initialization of the parameters of hidden layers and the adaptability to nonlinear data.(3)We efficiently solve a least-squares problem for the objective function, and obtain an optimal ELM for unsupervised domain adaptation. Moreover, classification experiments on object recognition and text data sets are performed to investigate our approaches, and the results show that DELM-CDMA and DKELM-CDMA can efficiently achieve the capability of transferring knowledge across domains.

The rest of this paper is summarized as follows: We describe domain adaptation, CDMA and ELM very simply in Section 2. Then, we represent DELM-CDMA and DKELM-CDMA in Section 3. Next, the experiment is performed in Section 4 to verify the validity of DELM-CDMA. Finally, Section 5 is the conclusion of this paper.

Since our method mainly extends ELM to handle domain adaptation, we will briefly introduce domain adaptation, CDMA and ELM in this section.

2.1. Domain Adaptation

Transfer learning [34, 36, 47, 48], as one of the important research branches of machine learning, helps target tasks to learn a high-quality model with the knowledge from source domains which have rich labeled data/samples but different distributions with target domains. Domain adaptation [36] is a hot topic in transfer learning and aims to adjust the distribution between domains in order to eliminate domain shift. In recent years, a large number of approaches have emerged for domain adaptation and have been roughly divided into five categories according to what is adapted, that is, instance adaptation, feature adaptation, parameter adaptation, deep network adaptation, and adversarial domain adaptation.(1)Instance Adaptation. Most instance-adaptation methods attempt to find a strategy in which each instance is assigned a weight to balance the distribution difference between domains. Li et al. [49] proposed Prediction Reweighting for Domain Adaptation (PRDA) which first reweighed classifier training on source domain and then adopted manifold regularization to diffuse the labels from high confidence samples to low confidence ones. TrAdaBoost [50], as a classic domain adaptation method, weighs each sample in source and target domains by a dynamic mechanism to minimize the distribution discrepancy between domains. Moreover, bad samples and good samples from source domains are distinguished by weight, which promotes the knowledge transferring across domains.(2)Feature Adaptation. This method usually finds shared feature subspace to reduce the domain distribution bias and transfer knowledge across domains. Many technologies such as K-L discrepancy [51], Bregman divergence [52], MMD [53], and Subspace alignment [5456] are used to minimize distribution discrepancy and seek optimal common features between domains. Transfer Component Analysis (TCA) [57] and Joint Distribution Adaptation (JDA) [53] solve a projection matrix minimizing MMD measurement from source and target domains and extract shared features to obtain effective target learner training on source domain.(3)Parameter Adaptation. Those approaches do not transfer shared knowledge in instance or feature levels but find optimal shared parameters of learner across domains. The DAELM-S(T), TELM-OWA, TL-ELM, and so on mentioned above belong to this method.(4)Deep Network Adaptation. This method combines the traditional domain adaptation technology with the deep learning model. It not only uses the former to reduce the distribution differences between domains but also applies the deep network structure of the latter to extract the high-level respective and semantic features across domains, which enables more efficient knowledge transfer. Domain Adaptation Network (DAN) [58], Joint Adaptation Network (JAN) [59], and Residual Transfer Network (RTN) [60] all utilize MMD to reduce the distribution differences between different domains and the data set bias.(5)Adversarial Domain Adaptation. Based on the generative adversarial network, this method produces target sample/data using domain-invariant feature generators, and subsequently applies the domain discriminator training on source samples to find the weakness of generated samples for generator adaptation, which could reduce the domain gap. As an unsupervised domain adaptation method based on deep feed-forward architecture, Domain Adversarial Neural Network (DANN) [61] introduces adversarial learning for domain adaptation. It simultaneously learns classifiers, feature extractors, and domain discriminators and obtains domain-invariant feature representation by minimizing classifier errors and maximizing the discriminant errors. Collaborative and Adversarial Network (CAN) [62] first uses collaborative learning to distinguish the domain-informative features of the sample belonging to source domain or target domain, and then utilizes adversarial learning to learn difficult-distinguish domain-uninformative features. Finally, domain-invariant features could be extracted.

In the methods mentioned in Section 2.1, the instance adaptation method has higher knowledge transferring efficiency, but it usually requires high-quality samples to ensure its excellent performance. Feature adaptation method has higher flexibility and wider application, but finding the shared feature with better generalization is a challenge. Due to the lack of direct utilization of sample and feature information, the parameter adaptation method usually has lower efficiency of knowledge transfer compared with the two former ones, but its domain knowledge utilization is more diversified. Deep network adaptation and adversarial domain adaptation have good performance owing to higher-level feature extraction. They strictly belong to feature adaptation method because obtaining good features is their primary task. However, Deep network adaptation usually requires plenty of labeled samples, more consuming-time and memory, simultaneously finding efficient feature extractors and discriminators is also a challenge for adversarial domain adaptation. In this paper, DELM-CDMA belongs to parameter adaptation method, and DKELM-CDMA is shallow feature adaptation method.

2.2. Cross-Domain Mean Approximation (CDMA)

To address the problem that MMD ignores sample individual difference, and thus insufficiently mining local information of data, Zang et al. [36] designed Cross-Domain Mean Approximation (CDMA), as shown in Figure 2. Given source domain with sample and according to its label , and target domain including sample and according to its label .The distribution discrepancy between source and target domains based on CDMA can be defined as follows:where , is the mean vector of the target (source) domain samples. Moreover, to adapt the marginal and conditional distribution together, label information of sample is added into CDMA, and (1) is modified as given in the following equation:where , is the mean vector of the target (source) domain with category.

ELM is a single-layer neural network with fast learning and high accuracy, without tuning input weight and bias. It efficiently obtains optimal output weight by solving a least squares problem. Given a data set including samples with label . A classic ELM network is given:

CDMA is an efficient evaluation metric of the distribution discrepancy between domains, which is adopted in our DELM-CDMA and DKELM-CDMA to reduce the marginal and conditional distribution discrepancy. Different from Joint Distribution Adaptation based on Cross Domain Mean Approximation (JDA-CDMA) [36] which uses CDMA to construct the objective function and incorporates subspace learning for extracting shared feature between the source and target domains, DELM-CDMA applies CDMA into the hidden layer of ELM for domain bias elimination and enhancing the generalization of the parameters of learned ELM and DKELM-CDMA applies CDMA together with reconstruction error for shared feature extraction in KELM framework.

2.3. Extreme Learning Machine (ELM)

ELM is a single-layer neural network with fast learning and high accuracy, without tuning input weight and bias. It efficiently obtains optimal output weight by solving a least squares problem. Given a data set including samples with label . A classic ELM network is given by the following equation:where is an input sample, and are the input layer weight and bias, respectively, represents the nonlinear activation function, represents the number of nodes in the hidden layer, and represents the hidden layer output weight. We can obtain an optimal by solving the objective function as given by the following equation:where is applied to avoid model overfitting, is the error vector corresponding to the empirical risk of sample with the tradeoff parameter . The constraint is removed and equation (4) changes into the following equation:where , , and .

The objective function is considered as a ridge regression or a regular least square problem. We set and the optimal solution is got according to [18] as given by the following equation:

Then, the prediction result of test sample can be determined by the following equation:where .

3. Proposed Methodology

In recent decades, ELM has been better than traditional classifiers of machine learning in speed and accuracy because of its ability in the generalization and global approximation. Thus, many variants of ELM appear for classification and regression in machine learning and pattern recognition to improve it in both theory and application. Nevertheless, ELM will be unsuccessful in the case of insufficient labeled samples [37]. In response to this issue, we develop Discriminative Extreme Learning Machine with Cross-Domain Mean Approximation (DELM-CDMA) for unsupervised domain adaptation, which use CDMA to narrow the marginal and conditional distributions between domains and adopt LDA to enhance category separability. Next, we will introduce it in detail in this Section 3.1.

3.1. Objective Function of DELM-CDMA

In unsupervised domain adaptation, given two different but related data sets: source domain owning sample with label , and target domain including unlabeled sample . We hope DELM-CDMA training on perform well on .

First, we transform and to and using activation function , and then construct the objective function of DELM-CDMA as given by the following equation:

From equation (8), we can see that there are three parts in the objective function of DELM-CDMA. represents the loss function of regularized ELM, denotes marginal or conditional CDMA measure between source and target domains, and is LDA term. and are tradeoff parameters of and , respectively.

3.1.1. Loss Function of Regularized ELM

In this paper, we carry out unsupervised domain adaptation task in which no labeled samples appear on target domain, so we can obtain a Regularized ELM on source domain as given by the following equation:

In equation (9), the first term is a regularizer to avoid over-fitting, and the second term is classification error of samples from source domain. is a parameter that balances the two terms.

In the iterative refinement process of labels, when the pseudo labels of the target samples are obtained, we could add the classification error from target domain, and equation (9) becomes equation (10).where and .

3.1.2. Cross-Domain Mean Approximation Measure

In unsupervised domain adaptation, since source domain and target domain have distribution discrepancy, the prediction error of model training on become increases on target domain. CDMA is an effective strategy for evaluation of inter-domain distribution differences. We apply it to measure the distribution discrepancy of output-layer data of ELM training on source and target domains, and compute it as given by the following equation:

In equation (11), the first term measures the marginal distribution discrepancy, and the second term measures the conditional distribution discrepancy.

3.1.3. Linear Discriminant Analysis on Output Layer

To further improve the discrimination of ELM, we add Linear Discriminant Analysis (LDA) into it. LDA forces the samples with the same class closer and the ones with different class farther so as to enhance ELM category separability in unsupervised domain adaptation. We apply LDA on data from the output layer, in which and , respectively, denote the between-class scatter and the within-class scatter and could been computed as given by the following equation:

In equations (12) and (13), , , represents samples with category in and . and are mean vectors of and , respectively. is the mean vector of the target (source) domain with category. Then, we construct the loss of LDA on the output layer as given by the following equation:

3.1.4. Total Objective Function

To improve the ability of knowledge transferring and discrimination of ELM, DELM-CDMA joins CDMA and LDA to ELM. Therefore, we bring equations (9) or (10), (11) and (14) into (8), and then the objective function of DELM-CDMA is established:

3.2. Model Learning

To solve optimal output weight, we let and yield

According to [18], equation (16) has a closed form solution:

After optimal is obtained, the test samples are classified by equation (7). A complete classification procedure of DELM-CDMA is summarized in Algorithm 1.

Input: Data set and , tradeoff parameters , , and . Output: Output layer weight
Step 1: Randomly initialize input weights and biases of the hidden layer with nodes, and set tradeoff parameters , , and .
Step 2: Transform and to and using , and set and to zero.
Step 3: Calculate according to equation (17).
Step 4: Use to predict and get its label .
Step 5: Solve and by using , , and according to equations (12) and (13).
Step 6: Repeat steps 3–5 until unchanged.
3.3. Kernelization

Although DELM-CDMA can improve the performance of ELM with fewer labeled samples or lack of conditions, it has two shortcomings: (1) The random initialization of hidden layer parameters will cause performance instability. (2) Inability to enhance the high-dimensional separability of nonlinear data. Therefore, we present a kernel version of DELM-CDMA named Discriminative Kernel Extreme Learning Machine with Cross-Domain Mean Approximation (DKELM-CDMA) to address above problems. DKELM-CDMA has two stages: feature extraction and classification.

3.3.1. Feature Extraction

At this stage, we let , , is a kernel function such as sigmoid kernel, linear kernel, and radial basis kernel. The objective function of DKELM-CDMA is given by the following equation:where and are its kernel maps of and ; represents samples with category in and . and are the kernel maps of and . and are mean vectors of and , respectively. is the mean vector of the target (source) domain with category.

The first term of equation (18) is , and the second and third terms are CDMA measure and LDA, respectively. and are tradeoff parameters of and , respectively.

The first part of is the regularization parameter to prevent model overfitting, and the second part of denotes the reconstruction error. is a parameter that balances the two parts. We set , and get

After obtaining , we could learn a new feature representation of and by , .

3.3.2. Classification

At this stage, we transform and to and by kernel mapping . We obtain a standard representation of kernel ELM according to [19].where is output of a test sample , is the tradeoff parameter of empirical risk in kernel ELM. We summarize the DKELM-CDMA procedure in Algorithm 2.

Input: Data set and , tradeoff parameters , , and . Output: Target output .
Step 1: Map and to and using , and set and to zero.
Step 2: Calculate according to equation (20).
Step 3: Set , , and map and to and using .
Step 4: Predict and get its label according to equation (21).
Step 5: Solve and by using , and according to equations (12) and (13).
Step 6: Repeat steps 2–5 until unchanged.

4. Experiment and Analysis

In this section, we will perform experiments on object recognition and text data sets (described in Tables 1 and 2) for classification tasks under domain adaptation, and estimate DELM-CDMA and DKELM-CDMA. For fairness, all experiments were conducted on a PC with 8 GB memory and Windows 10 operating system. The algorithms are implemented in MATLAB 2017b. Each experiment runs 20 times, and the average is recorded. The accuracy rate [36, 39, 53] is adopted to evaluate each algorithm in experiments.

4.1. Data Set Description

USPS + MNIST [53]: USPS and MNIST (shown in Figure 3 and Table 1) are two image data sets with different but related distributions for handwritten number recognition. They share 10 categories from 0 to 9. USPS data set contains 9,298 images with 16 × 16 pixels, and MNIST data set collects 70,000 images with 28 × 28 pixels. In this section, 1800 images from USPS data set and 2000 images from MNIST data set are selected for domain adaptation, and then we construct two domain adaptation tasks: USPS ⟶ MNIST, MNIST ⟶ USPS. Moreover, all the images are uniformly transformed into grayscale images with 16 × 16 pixels, and each image is represented by a 256-dimensional vector encoding the grayscale values of all pixels.

MSRC + VOC2007 [63]: MSRC and VOC2007 (shown in Figure 4 and Table 1) are two common object recognition image data sets. MSRC has 4,323 images from 18 categories, and VOC2007 contains 5,011 images from 20 categories. Both are different but related. In MSRC and VOC2007 data sets, we, respectively, choose 1269 pictures and 1530 pictures from 6 shared classes including airplanes, birds, cows, family cars, sheep, and bicycles, and set two domain adaptation task: MSRC ⟶ VOC and VOC ⟶ MSRC. In this experiment, all the pictures are transformed into 0∼256 gray pixels, and extract 240 dimensions. All images are rescaled to be 256 pixels in length, and extract 128-dimensional dense SIFT (DSIFT) features. A 240-dimensional codebook is created, where K-means clustering is used to obtain the codewords.

Office + Caltech [53]: Office + Caltech data set (shown in Figure 5 and Table 1) released by Gong [51] for visual cross-domain object recognition. Office data set includes three sub-data sets: Amazon (A), Webcam (W), and DSLR (D), in which 4,652 images with31 categories are collected. Caltech-256 (C) is also a benchmark data set for target recognition, in which 30,607 images from 256 categories are collected. In this experiment, we select 10 shared categories in four domains C (Caltech-256), A (Amazon), W (Webcam), and D (DSLR) with a total of 2,533 images with 800 SURF features, shown in Table 1. During the experiment, we can construct 12 groups of domain adaptation task, such as C ⟶ A, C ⟶ W, C ⟶ D, …, and D ⟶ W.

Office31 [64]: Office-31 (shown in Table 1) is a standard benchmark for domain adaptation, which contains 4,652 images with 31 categories and consists of three real-world object domains: amazon, webcam, and dslr. In this paper, we use 2,048-dim ResNet-50 [65] features to conduct 6 cross-domain tasks, that is, amazon vs dslr, amazon vs webcam, dslr vs amazon, dslr vs webcam, webcam vs amazon, webcam vs dslr.

Reuters-21578 [66]: Reuters-21578 text data set is commonly used for text categorization in domain adaptation, in which 21,577 news articles were collected and annotated by Reuters and divided into 5 categories: exchanges, orgs, people, places, and topics. In every text, a corpus becomes a digital data processing that takes out every word that appears as a feature. In this section, we adopt articles from 3 largest classes (shown in Table 2): orgs, people, and place, and then construct 6 domain adaptation tasks: orgs vs people, people vs orgs, orgs vs place, place vs orgs, people vs place, and place vs people.

4.2. Compared Algorithm and Setting

To investigate our method, we choose several classification algorithms for comparison.

4.2.1. Non-Adaptation Domain Classifiers

1NN: k Nearest Neighbor classifier; we choose one sample as target nearest neighbor.

SVM: Support vector machine with linear kernel; we set SVM penalty parameter belong to .

ELM: Standard extreme learning machine.

SSELM: Semi-Supervised ELM with graph regularization [24].

4.2.2. Shallow Adaptation Domain Classifiers

TCA1, TCA2: Classifier combining TCA [57] with 1NN (TCA1) and classifier combining TCA with SVM (TCA2).

JDA1, JDA2: Classifier combining JDA [53] with 1NN (JDA1) and classifier combining JDA with SVM (JDA2). We set parameters in TCA and JDA according to [53].

DAELM_S, DAELM_T [37]: Supervised ELMs for domain adaptation.

KMM [67]: Kernel Mean Matching with resampling weights. Its result in this section is cited from [42].

LMPROJ [68]: A large margin hyperplane classifier based on SVM. We cite its result from [42].

ARRLS [66]: A general transfer learning framework referred to adaptation regularization based on transfer learning using squared loss. We set its parameters according to [66].

TELM-OWA [39]: Supervised ELM using output weight alignment for domain adaptation. We set its parameters referred [39].

CDELM-M, CDELM-C: An unsupervised ELM using MMD and manifold regularization for domain adaptation. We cited its result in [42] for comparison.

JUC-SDELM [69]: An unsupervised ELM using MMD and the scalable factor for domain adaptation. We cited its result in [69] for comparison.

4.2.3. Deep Domain Adaptation Classifiers

DAN [58]: A deep domain adaptation network with multi-kernel MMD for domain adaptation. We cited its result in [59] for comparison.

JAN [59]: A deep domain adaptation network for domain adaptation by aligning the joint distributions of multiple domain-specific layers across domains. We cited its result in [59] for comparison.

DANN [61]: An adversarial domain adaptation network with domain-invariant feature extractor. We cited its result in [62] for comparison.

CAN [62]: A adversarial domain adaptation network adopting collaborative learning and adversarial learning for domain adaptation. We cited its result in [62] for comparison.

ELM-CDMA, DELM-CDMA, and DKELM-CDMA are our methods. ELM-CDMA is a special case of DELM-CDMA when the LDA term is 0.

It is noteworthy that we set the penalty parameter in ELM, SSELM, DAELM_S, DAELM_T, TELM-OWA, ELM-CDMA, and DELM-CDMA. In ELM-CDMA and DELM-CDMA, we set , , and on Office + Caltech and offfice31 data sets, on USPS + MNIST and MSRC + VOC2007 data sets, on Reuters-21578 data set. We evaluate DAELM_S, DAELM_T, and TELM-OWA in unsupervised domain adaptation task by selecting 0.5% labeled target samples on USPS + MNIST, MSRC + VOC2007, and Reuters-21578 data sets and 1% labeled target samples on Office + Caltech data set to train model. For DKELM-CDMA, we set balance coefficient on USPS + MNIST, Reuters-21578, Office + Caltech and offfice31 data sets, on MSRC + VOC2007 data set, and set , penalty coefficient and kernel parameter on 5 data sets at feature extraction stage. We let and kernel parameter on 5 data sets at classification stage.

4.3. Results and Analysis

We present the accuracy of all algorithms in Tables 35 and Figures 610 and find that(1)DKELM-CDMA achieves best average result in Tables 3 and 4 with help of CDMA and LDA. Especially in Table 3, the average result is 56.7%, much higher than other methods, which indicate that CDMA could efficiently reduce the distribution discrepancy of output-layer data from source and target domains and significantly promote the knowledge transfer capability of ELM in unsupervised domain adaptation. DKELM-CDMA could further mine shared features across domains, so it achieves better accuracy than DELM-CDMA.(2)DELM-CDMA and ELM-CDMA perform better than other methods on most tasks in total average, which shows again that CDMA is efficient for unsupervised domain adaptation. It extends ELM in wide application scope and scene. DELM-CDMA outperform ELM-CDMA on most tasks with the help of LDA, showing that LDA also improves discrimination of ELM for domain adaptation.(3)Compared with DKELM-CDMA, DELM-CDMA, and ELM-CDMA, TELM-OWA and DAELM_T get bad results, because they are supervised classifiers requiring a few labeled samples on target domain. Moreover, the traditional classifiers, such as 1NN, SVM, and ELM, are unsuccessful because of inability in knowledge transfer. ELM gains slightly better performance than 1NN and SVM due to its good generation. SSELM has better performance than ELM by mining manifold structure information of data.(4)Although ARRLS, CDELM-M(C), JUC-SDELM, KMM, and LMPROJ also can gain better results with the help of statistical adaptation mechanism like MMD, (D)ELM-CDMA are more successful duo to CDMA is more efficient than MMD. TCA1 (2) and JDA1 (2) have higher accuracy than 1NN and SVM, depending on cross-domain shared feature extraction.(5)In Table 5 and Figure 10, we run 1NN, SVM, ELM, SSELM, DAELM_S, DAELM_T, ARRLS, TELM-OWA, ELM-CDMA, DELM-CDMA, and DKELM-CDMA on the Office-31 data set with ResNet-50 features comparing with deep adaptation domain methods DAN, DANN, JAN, and CAN. As a machine learning classification model with shallow network structure, DELM-CDMA, although not extracting shared high-level features, performs well on more complex data sets combined with deep feature extraction networks. It clearly demonstrates that DELM-CDMA equipped with deep generic features could further reduce the cross-domain discrepancy and achieve the best adaptation performance, which demonstrates the potential of our method. However, DKELM-CDMA does not work well on data office31. A possible explanation is that its feature extraction process hurts the quality of features already extracted by ResNet-50.

We carry out experiment on MNIST⟶USPS data set to compare the running-time in 1NN,SVM, ELM,SSELM,TCA1(2), JDA1(2), DAELM_S, DAELM_T, ARRLS, DELM-CDMA, and DKELM-CDMA, and the results are shown in Table 6. It can be seen: (1) ELM has least running-time because of fast learning speed, and the algorithm based on ELM also spend less time cost than ones based on 1NN and SVM, such as TCA1(2) and JDA1(2). (2) DELM-CDMA spends more running-time than DAELM_S and DAELM_T because of the additional calculation generated by CDMA and LDA, but its time cost is lower than ARRLS in which MMD matrix and graph regulation term are required to construct. (3) TELM-OWA and DKELM-CDMA have similar time consumption more time than ELM, SSELM, DAELM_S, and DAELM_T, because they all need to solve the output weights twice during training. (4) SSELM needs more time than ELM because of manifold regulation. (5) JDA2 has most running time because it need to construct the marginal and conditional MMD matrix, and SVM has a complex solving process.

Through the above analysis, it can be seen that compared with other methods, DKELM-CDMA and DELM-CDMA do not require more time consumption while achieving higher accuracy.

4.4. Parameter Sensitivity Analysis

To evaluate the sensitivity of DELM-CDMA to parameters , , and number of hidden layer nodes() and its convergence, we run experiments on org vs people, MSRC vs VOC, MNIST vs USPS, A vs D and amazon vs dslr data sets, and the results are shown in Figures 11(a)–11(e). We can see that: (1) the accuracy increases first and then little decreases with , , growing, and DELM-CDMA achieve optimal results when , , and , respectively. Because , , and adjust CDMA, LDA, and output weight regularization terms, results in Figures 11(a)–11(c) show that CDMA, LDA, and output weight regularization term, adjusted the appropriate range, can improve the knowledge transfer ability and discrimination of ELM and avoid model from overfitting. (2) From Figure 11(d), we can see that the accuracy of DELM-CDMA increases first and then slightly decreases with growing on most data sets. Wider hidden layer as soon as possible can improve the approximation ability of the function, but it will weaken the ELM performance of domain adaptation by damaging cross-domain metric performance of CDMA. (3) As shown in Figure 11(e), by observing the variation of accuracy with the increase of iterations, we find that finally converge after several iterations, which shows that DELM-CDMA has good robustness.

Similar to DELM-CDMA, we carry out some experiments on org vs people, MSRC vs VOC, MNIST vs USPS, A vs D data set, and amazon vs dslr data sets to observe influence of parameter , , and on the accuracy of DKELM-CDMA, its convergence, and CDMA distance. We record results in Figure 12 and can see that: (1) Generally, the sensitivity of parameter , , and are different for various data sets, but DKELM-CDMA mainly achieves good performance when , , and , respectively, shown in Figures 12(a)–12(c). (2) From Figure 12(d), we can see that the accuracy is increasing iteratively, and finally converges after several iterations, so the robustness of DKELM-CDMA is strong. (3) We investigate CDMA distance with iteration growing, and the result is shown in Figure 12(f). We can see that: with the increase of iteration number, the CDMA distance becomes small, and the accuracy becomes high, which indicate that DKELM-CDMA could extract effective shared feature representation across domains relying on CDMA to reduce differences in marginal and conditional distributions between domains, and facilitate the cross-domain transfer of knowledge.

5. Conclusion

In this paper, we propose a novel ELM model called Discriminative Extreme Learning Machine with Cross-Domain Mean Approximation for Unsupervised Domain Adaptation. Based on traditional ELM, it introduces Cross-Domain Mean Approximation to jointly minimize the marginal and conditional distribution from the data of output layers between source and target domains which enhance the ability of ELM to transfer knowledge. We also add Linear Discriminant Analysis to further improve the discrimination which improves the accuracy of ELM. In addition, in order to overcome the sensitivity of DELM-CDMA to the initialization of hidden layer parameters, we further propose DKELM-CDMA as the kernel version of DELM-CDMA, which further facilitates the fitting for nonlinear data. Finally, we run DELM-CDMA and DKELM-CDMA on many experiments for unsupervised domain adaptation, and the results show that the proposed approach can effectively enhance the efficiency of cross-domain knowledge transfer in ELM. Although DELM-CDMA and DKELM-CDMA work well in domain adaptation, they cannot fully utilize high-level semantic features representation because of their own shallow network structure. Therefore, we design a DKELM-CDMA with deep learning for feature extraction in future.

Data Availability

The data used to support the findings of this study are found in https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 62073124 and 51805149, in part by the Key Scientific Research Projects of Universities in Henan Province under Grant 22A120005, the National Aviation Fund Projects under Grant 201701420002, and Henan Province Key Scientific and Technological Projects under Grant 222102210095.