Abstract

Cross-project defect prediction (CPDP) is an attractive research area in software testing. It identifies defects in projects with limited labeled data (target projects) by utilizing predictive models from data-rich projects (source projects). Existing CPDP methods based on transfer learning mainly rely on the assumption of a unimodal distribution and consider the case where the feature distribution has one obvious peak. However, in actual situations, the feature distribution of project samples often exhibits multiple peaks that cannot be ignored. It manifests as a multimodal distribution, making it challenging to align distributions between different projects. To address this issue, we propose a balanced adversarial tight-matching model for CPDP. Specifically, this method employs multilinear conditioning to obtain the cross-covariance of both features and classifier predictions, capturing the multimodal distribution of the feature. When reducing the captured multimodal distribution differences, pseudo-labels are needed, but pseudo-labels have uncertainty. Therefore, we additionally add an auxiliary classifier and attempt to generate pseudo-labels using a pseudo-label strategy with less uncertainty. Finally, the feature generator and two classifiers undergo adversarial training to align the multimodal distributions of different projects. This method outperforms the state-of-the-art CPDP model used on the benchmark dataset.

1. Introduction

Software defect prediction (SDP) [1] aims to enhance the efficiency of resource allocation by predicting program modules that are likely to have defects before the software enters the testing phase. This allows for a targeted allocation of resources for code inspection based on the prediction outcomes. SDP commonly relies on historical project data to construct predictive models. However, in the initial stages of developing a new project, historical data are often scarce.

To tackle this challenge, researchers [2, 3] proposed cross-project defect prediction (CPDP) technology. This technology makes use of defect data from projects originating from different sources, enabling the application of SDP in the early phases of new projects. In CPDP, a prominent issue arises from the differing sources of various projects, resulting in distribution disparities. If the classification boundary learnt from the source project probability distribution is applied directly to the target project, it may underperform on the target project. This is primarily due to the distinct probability distributions between the source and target projects. Applying the knowledge learned from the probability distribution of the source project directly to the target project can significantly degrade the prediction performance [4].

To mitigate this, researchers have suggested employing transfer learning methods. These methods aim to reduce the distribution disparity between different projects, thereby addressing the issue of disparate distributions. For instance, Qiu et al. [5] utilized both semantic features and handcrafted features as joint features [6]. Simultaneously, they incorporate the maximum mean discrepancy (MMD) [7] to minimize the feature distribution gap between the source and target project. Similarly, Xing et al. [8] propounded an adversarial long short-term memory neural network (G-LSTM), incorporating joint features and employing adversarial training to mitigate distribution disparities. These methods take into account the reduction of differences in the distribution of features between source and target items, which is essential to improve the effectiveness of CPDP. They [5, 8] attempt distribution alignment based on the assumption that the features of the project have a unimodal distribution. However, in practice, software projects are often codeveloped by multiple developers, each with their own coding style. Additionally, modules with different functions vary greatly in complexity. Particularly, handcrafted features are typically extracted based on software metrics, code analysis, and other methods, which have different statistical properties compared to those extracted by deep learning. These differences result in the feature distribution of the project not exhibiting a single ideal peak but rather multiple distinct peaks. In other words, project features actually display a multimodal distribution. Each of these peaks typically represents a pattern or cluster within the project, and the different peaks indicate a higher concentration of the frequency or value of the feature in different regions [9]. Traditional methods can only align a portion of the peaks and fail to achieve complete alignment (as illustrated in Figure 1(a)). This still results in significant distribution disparities, subsequently diminishing the accuracy of the cross-project predictor.

To address the challenge, we propose the balanced adversarial tight matching (BATM) method. For program features extracted from projects, we obtain the cross-covariance between features and classifier predictions using multilinear conditioning to capture the multimodal distribution of features. Based on the captured distribution, we need to further know the distribution differences between projects so as to reduce it. Therefore, we measure the distribution differences between different projects by extending the maximum density divergence (MDD). When calculating MDD, the label information is needed, but the target project does not have the label information, so we introduced pseudo-labels. Due to the distribution gap between projects, there is a certain deviation between the pseudo-labels and the real labels. To solve this problem, we additionally add an auxiliary classifier to correct the pseudo-labels by minimizing the prediction bias between the main classifier and the auxiliary classifier. The intuitive effect of the BATM method is shown in Figure 1(b). It continuously “pulls” unaligned peaks over during training and eventually achieves the alignment of multimodal distributions. Additionally, we conducted specific experimental verification in Section 6.3.

This paper’s primary contributions are outlined below:(i)We propose an adversarial defect prediction framework based on multilinear conditioning. It achieves a closer alignment between the distributions of different projects by capturing the multimodal distribution of data.(ii)There will inevitably be noise in pseudo-labels, leading to uncertainty. We reduce pseudo-label uncertainty by minimizing the prediction bias. Specifically, it acts as a regularization term to correct the learning of noisy pseudo-labels, making the MDD measure of distribution differences more reliable and effective.(iii)We conducted comparative experiments between our method and 9 baseline methods on 11 common open-source projects. The experimental results show that our method achieves state-of-the-art results in measure, balanced accuracy, and AUC.

The residual content of this paper is relevant in the following: Section 2 reports the related work. Section 3 narrates the specifics of our proposed method. Sections 4 and 5 cover the experimental setup and present the experimental results. Section 6 presents the experimental discussion. Finally, Section 7 offers a summary of our contributions.

In CPDP, several methods have been proposed to exploit the features of the program. Nguyen and Nguyen [10] propounded a source code semantic model called abstract syntax trees (ASTs). Further, based on AST, Wang et al. [6] employed a deep Bayes network (DBN) model to acquire semantic features from ASTs, bridging the semantic gap between different project features and defect prediction. Inspired by the ability of convolutional neural network (CNN) to effectively extract semantic features from text [11], Li et al. [12] investigated the application of CNN for generating semantic features of programs. Recognizing the significance of code context semantics, Xing et al. [8] presented G-LSTM learning model to autonomously extract program semantics and context-dependent features.

To enhance prediction across diverse projects, researchers have explored transfer learning methods to align feature distributions between source and target projects. Ma et al. [13] propounded the transfer naive Bayes (TNB) model to address the CPDP problem. Pan et al. [14] investigated the use of transfer component analysis (TCA), a transfer learning technology, in the context of CPDP. TCA involves mapping the source dataset and the target dataset into a latent space, facilitating data migration by minimizing the distance between them. Subsequently, they proposed an enhanced transfer defect learning method called TCA+ [15], which employs customized normalized extended TCA. The findings revealed that TCA+ significantly outperformed other comparative models in certain projects in terms of performance. Given that joint features can more effectively capture the distinctive features of project defects, Zou et al. [16] devised a joint feature representation learning method. They implemented a repeated pseudo-label strategy to narrow down the feature distribution gap among different projects. In order to extract transferable features, Qiu et al. [5] propounded a transfer convolutional neural network (TCNN) model. They added a matching layer based on CNN for feature mapping and utilized MMD [7] to mitigate the differences between various projects. Additionally, Huang et al. [17] utilized serialized ASTs and handcrafted feature generation vectors, employing multicore MMD to align the feature distribution of different projects. Given the substantial imbalance in class within the project, Tang et al. [18] presented an algorithm for transfer learning, named TsboostDF. This algorithm specifically addresses both the knowledge transfer and class imbalance challenges inherent in CPDP. Recognizing the effectiveness of adversarial generation, Cheng et al. [19] applied kernel-based principal component analysis to transform instances, mapping them to a high-dimensional feature space. Subsequently, they employed adversarial learning to acquire representative features for model construction. Song et al. [20] employed two classifiers for detecting target samples that are distant from the source sample support vector through adversarial training. They also utilized a generator to minimize the difference in the target samples, aligning the distribution.

However, the previously mentioned CPDP methods all operate under the assumption of unimodal distribution. It may potentially hinder performance improvement as they overlook the presence of multimodal data distributions. In this paper, we propose a novel BATM model. It not only considers the multimodal distribution but also enhances the effectiveness of MDD using a pseudo-label marking method with lower uncertainty. This approach leads to a more significant improvement in performance.

3. Methodology

In this section, we initially provide a formal description of the CPDP problem. Then, the overall process of the BATM method is presented, and the specific modules of the method are reported in order.

3.1. Problem Definition

This paper defines the source project as the source domain , comprising the feature space of the source project data along with its corresponding probability distribution . The sample set from the source domain is denoted as . Similarly, the target project is considered the target domain , consisting of the feature space of the target project data and its corresponding probability distribution . The sample set from the target domain is represented as .

Typically, while two software projects from different sources may share some common metric elements, the distributions to which these metric elements adhere are generally distinct, i.e., . The objective of CPDP is to train a classifier dependent on the data from the source project and apply it to the target project. To achieve this cross-domain migration between two projects from different sources, it is often necessary to minimize the distribution disparity between the source project and the target project, a process known as domain adaptation. In the model training process, CPDP researchers also utilize some unlabeled target domain data as auxiliary information to construct the final prediction model, achieving more effective transfer results.

3.2. Overall Workflow

The overall workflow of our proposed CPDP method is depicted in Figure 2. Specifically, our framework consists of the following steps: We begin by parsing the project source file into an AST and traversing the AST to acquire the token vector. The token vector is then transformed into an integer vector through predefined mapping rules. Next, the integer vector is input into the transformer, which acts as a generator to extract semantic features. These semantic features are subsequently combined with handcrafted features to form joint features. After data preprocessing, we first train a main classifier and an auxiliary classifier based on the source project and then use the main classifier to generate pseudo-labels for the target project. Next, the joint features and classifier predictions are performed in multilinear conditioning in order to capture the multimodal distribution of the data. Subsequently, we utilize MDD to train a generator and two classifiers based on the adversarial learning method. In this process, the pseudo-labels are continuously corrected, and the purpose of reducing the multimodal distribution difference between the two projects is achieved. Finally, the constructed model is deployed on the target project to predict instances where defects may occur. Below, we provide details about the algorithm of our proposed method.

3.3. Generating Input Vectors

The AST is a semantic structure extracted from the source code, represented as a multibranch tree. It serves to depict the syntax structure of the source code. The AST offers a high level of information compression and comprises a limited set of node types. Its structure aligns well with the semantic information of the program code. These advantages enable ASTs to be initially used in defect prediction [6].

In Figure 2(a), we employ the open-source compilation tool Javalang to process Java source files and create the corresponding AST. While the AST contains various types of nodes, a significant portion of them are redundant, with only a small portion highly related to code defects. Building on prior research [6], we thoughtfully select four types of nodes: method invocation, declaration, control flow, and other essential node types. Then, the AST is traversed through depth-first after the filtered node is obtained to obtain the mark sequence. As these tag sequences are string representations and cannot be directly fed into the transformer network, we establish a one-to-one mapping rule between nodes and positive integers. The mapping enables the node label sequence to be transformed into the corresponding integer vector, which can then serve as input for the transformer network.

3.4. Generator

Considering that the source code of a software project is also a special form of text, it follows certain grammatical rules, with keywords, etc., holding specific semantic meanings. Therefore, CPDP tasks share significantly related to natural language processing (NLP) tasks [21, 22]. The transformer network [23], known for its proficiency in learning long-term dependencies through complex computations, is especially well-suited for handling NLP tasks. We argue that the transformer network, with its self-attention mechanism, is effective at extracting contextual and semantic features from software projects. Therefore, the BATM method utilizes this technique to extract semantic features from the program, serving as a feature generator for the entire model.

The structural diagram of the transformer is illustrated in Figure 3. In the initial step, position encoding is applied to each integer vector within the input sequence data . In this manner, words at different positions will be assigned varying importance as they pass through the self-attention mechanism. This enables the model to comprehend the order of words in the input sequence. Here, we utilize and to represent the position encoding at indices and of the given positioning in the sequence, respectively. The corresponding formula is described as follows:

Among them, represents the embedding dimension of the model, which is the dimension of word embedding. The constant 10,000 is used to adjust the periodicity of the sine function sin and the cosine function cos.

Subsequently, the integer vector with added position encoding undergoes processing in the self-attention layer. The weight matrices learned by the three models are multiplied by the element in the input sequence to obtain the vectors. Then, calculate their self-attention values as shown in Formula (2):

In the formula, represents the inner product of the transpose of the query vector and the key vector . is a normalization factor, typically the square root of the dimension of the matrix. This normalization operation ensures that the input dimensionality is not affected in the attention calculation. The softmax operation is employed to normalize the attention score into a probability distribution. Finally, when multiplied by the attention weight , the self-attention output is obtained. As depicted in Figure 3, represents the resulting self-attention output.

Following this, the output of the self-attention layer performs standardization through the layer normalization layer to ensure model stability. In the feed-forward layer, features are mapped to a higher-dimensional space using an activation function. This increases the model’s representational capacity and enables it to capture richer semantic information. It serves to further process the normalized representation for extracting higher-level semantic features. Finally, normalization occurs through the add and normalize layer to ensure a stable model output.

To enhance feature transferability and separability, we combine the semantic features acquired by the transformer with traditional handcrafted features to form joint features.

3.5. Adversarial Training

The adversarial training objective of the BATM model is to effectively capture the multimodal distribution of data through multilinear conditioning. This is achieved by integrating it with MDD to provide a more accurate measure of distribution disparities between different projects. Ultimately, this process aligns the distributions and enhances the prediction accuracy.

Initially, the model trains the generator and classifiers and using the source projects. Due to the ample labeled samples available for source projects, the classifier is able to classify them correctly after training.

In detail, BATM engages in adversarial training on the generator and classifier using source project samples. This ensures that the model can effectively classify the source project samples. This article chooses the logistic regression (LR) classifier as the foundational classifier. Given that the CPDP problem entails a binary classification task, we employ binary cross-entropy loss in this step, as depicted below:

Here, and denote the outputs of classifiers and , respectively. Unless stated otherwise, the classifier outputs refer to the average result produced by both classifiers and .

Subsequently, the trained classifier is employed to classify the target projects and assign corresponding pseudo-labels. Next, multilinear conditioning comes into play to grasp the multimodal distribution of sample features. Considering that uncertainty in discriminative information can introduce prediction noise, we aim to minimize this uncertainty by reducing the prediction bias between two classifiers.

In this paper, multilinear conditioning is achieved through the use of a multilinear map. A multilinear map is determined by the outer product of numerous random vectors [24]. For two random vectors and , the joint distribution can be represented by the cross-covariance , where is the feature map generated by a specific reproducing kernel [25, 26]. Suppose there are feature vectors , their joint feature vector , label vector , then the formal representation of the cross-covariance is as follows:

Therefore, the multilinear map can capture the cross-covariance between feature representations and labels, thereby capturing the multimodal distribution of complex features. The representation of the multilinear map of features and predicted labels is as follows:

Among the multilinear map , two notable ones are and . represents the feature representation, which combines the joint features of the source project and the target project. Meanwhile, represents the average value predicted by classifiers and . However, one drawback of a multilinear map is dimensionality explosion. Here, let and denote the dimensions of vectors and , respectively. The dimension of the multilinear map is given by . If this dimension becomes too high, parameter explosion will occur when embedding a deep network. To tackle this problem, the paper suggests a resolution using the random multilinear map to mitigate the dimension explosion. Since the inner product on can be accurately estimated by the inner product on [27], in order to enhance the computational efficiency, we use to represent a random multilinear map. Its formulaic description is as follows:where is the element-wise product, dimension , , and are random matrices sampled only once and fixed in training. Each element follows a symmetric distribution with single variance, that is .

Furthermore, we require a metric to quantify this distribution disparity, which is addressed by MDD in this context. MDD [28] serves a dual purpose: not only does it gauge the distribution difference between two distinct multimodal distributions, but it also diminishes the divergence between domains. This leads to an increase in intraclass density, rendering the discriminative information of defective and nondefective samples on the target project more discernible. Thereby, it benefits the classifier’s prediction of the target project. We denote the difference between and as and aim to harmonize the source and target domain distributions by minimizing . The formal description of MDD is as follows:

Among these, and represent independent and identically distributed copies of and , respectively. In the first term of MDD, we utilize the squared Euclidean norm distance to minimize the distribution difference between and . The subsequent two terms serve to maximize the within-project density of and , respectively. This method not only aligns the distribution differences between the source and target projects but also brings the two projects themselves closer together. In practice, we calculate the MDD loss by the following formula:

In this context, represents the generated sample feature, denotes half of the batch size. Additionally, and , respectively, signify the number of categories in the source domain and target domain within the batch. These quantities can be determined by examining the label of each instance in the present batch. represents and have the same label, where represents the sample of the source project, is a copy of .

From the above description of MDD, we can observe that one of its objectives is to maximize intraclass density. To achieve this goal, the label information of the project samples must be utilized. In the case of the target project, obtaining its label information is not feasible, so pseudo-labels are employed instead. However, it is important to note that the pseudo-labels assigned to target project samples are uncertain, as these labels may contain prediction noise [29]. While most pseudo-labels are accurate, some incorrect labels might be present. Fine-tuning the model on noisy labels could transfer the error to the adapted model. This transfer may potentially exacerbate the adaptation process, especially for samples with uncertain predictions [30]. The auxiliary classifier added in this paper differs from the classifier solely in terms of the initialization method. Both are capable of classifying project samples. By minimizing the prediction deviation between the two classifiers, target projects can be more accurately labeled with corresponding pseudo-labels in the training process.

To account for label noise, we measure pseudo-label uncertainty by prediction bias. In practice, we utilize the KL divergence predicted by classifiers and as an estimate of the bias as follows:

In this context, signifies the output of the main classifier , while denotes the output of the auxiliary classifier . When these two classifiers offer divergent class predictions, the prediction deviation will yield a substantial value. This highlights the model’s uncertainty in its predictions. We attempt to reduce this prediction bias to achieve a “balance” between classifiers, thereby reducing the uncertainty of prediction.

Additionally, it is important to assess the disparity between the predictions and labels of the main classifier . For this purpose, the binary cross-entropy loss is employed and formalized as follows:

Given that the target project has no actual label, is considered a pseudo-label in this context. The paper incorporates a bias regularization term to enhance learning from potentially noisy labels. This regularization term can be formulated as follows:

For the average output of classifiers and , we confuse the classifier by maximizing the entropy of this output. Our expectation is for the classifier to not accurately identify the generated samples and real samples, thus aligning the distribution. We calculate the entropy loss of the classifier output using the following formula:

Based on the above adjusted adversarial training [31] generator and classifier, we construct a total loss function to perform backpropagation to update parameters. Specifically, the comprehensive, objective function of our BATM method can be formulated as follows:

In the formula, is used to find the optimal parameters through cross-validation. After a certain number of iterations, the multimodal distributions between source and target projects are effectively aligned. Algorithm 1 describes the pseudocode of BATM. We repeat these two steps during the training phase. We use a variance regularization term to correct learning from noisy labels (Step 1), and then adversarially train the classifiers and generator after pseudo-labeling the target projects (Step 2).

Input: Source project and label data target project data parameters of classifiers and and generators , and , learning rate , pseudo-label , output of the generator to the target project , maximum iteration .
Ouput:, and .
1: procedure BATM ()
2:   Initialize the parameters , and randomly
3:   for do
4:    / Step 1 /
5:    Compute the loss
6:    
7:    
8:    
9:    / Step 2 /
10:    / Predicting pseudo-labels for target projects /
11:    
12:    Compute the loss
13:    
14:    
15:    
16:   end for
17: end procedure

4. Experiment Settings

This paper employs the machine learning framework PyTorch for the implementation of the BATM model. Specifically, the number of heads in the multihead attention model of the transformer network is configured to be 16. Throughout the training process, a batch size of 32 is utilized. The training process utilizes a stochastic gradient descent, and the Adam optimizer [32] is employed to update parameters. The default momentum parameter (0.9) in Adam and the initial learning rate is set to 0.001.

4.1. Benchmark Datasets

In recent years, researchers [5, 8] have shown a preference for utilizing open-source software projects in CPDP research. To evaluate the BATM model proposed in this paper, 11 Java projects from the PROMISE [33] benchmark dataset were selected. This particular open-source software repository is widely utilized in the CPDP field.

This paper carefully chooses 11 open-source software projects from the PROMISE dataset (refer to Table 1). In this dataset, researchers carefully enumerated and computed 20 handcrafted features (see Table 2), with a predominant focus on program complexity [36]. Within the scope of the CPDP problem, defect prediction necessitates the specification of both a source project and a target project. Thus, this paper initially designates a project as the target. Then, it utilizes the remaining projects in the dataset as sources to establish a CPDP task pair originating from the source project to the target project. Following this approach, 110 sets of CPDP task pairs can be constructed within the PROMISE dataset, forming the basis for the experiments in this paper.

4.2. Evaluation Metrics

To assess the reliability and effectiveness of the BATM model, we employ three widely used evaluation metrics: measure, balanced accuracy, and area under curve (AUC) [38]. In CPDP, it is customary to utilize the confusion matrix (Table 3) for evaluating the classifier’s performance.

The effectiveness of a classifier on positive classes can be assessed using sensitivity (or recall), while its effectiveness on negative classes can be evaluated using specificity. Precision is the metric used to assess the accuracy of a model. Below, we provide the calculation formulas for these indicators:

The measure, being the harmonic mean of sensitivity and precision, offers a balanced evaluation of these two indicators. It aims to find the optimal balance between making accurate positive predictions and capturing all positive instances. The formal description of measure is as follows:

In CPDP tasks, classifiers that predict all samples as majority classes can perform well in accuracy (ACC). Therefore, this paper employs balanced accuracy (BA) to gauge the validity of the CPDP model. It provides a comprehensive evaluation considering the classifier’s performance in both majority and minority classes. The calculation expression for balanced accuracy is as follows:

To comprehensively compare the validity of BATM and the baseline methods, this chapter employs the AUC evaluation index [38]. AUC stands for the area under the curve. It is the area of a 2D graph illustrating the connection between the true positive rate and the false positive rate. The y-axis denotes the true positive rate, while the x-axis represents the false positive rate. The ROC curve is generated by adjusting the classification threshold across all possible value ranges, thereby distinguishing between instances with defects and those without. An effective predictor should yield an AUC value as close to 1 as possible. Additionally, ROC analysis shows strength, particularly in situations with class imbalance and uneven classification costs. Hence, this metric is well-suited for applications in CPDP contexts.

4.3. Baseline Methods

We benchmark the BATM method against the following nine baseline methods:(i)LR: It relies on handcrafted features as input for the LR classifier in the prediction process.(ii)NNFilter [39]: Only handcrafted features are utilized, and similar instances are aggregated to construct a training set.(iii)TCA [14]: It maps data from different projects together into a latent space. By minimizing the distance between them, data migration is accomplished while preserving the geometric structure and data variance intact.(iv)TCA+ [15]: This is a transferable feature learning method. TCA is expanded by exclusively incorporating handcrafted features along with customized normalization rules.(v)DBN [6]: This method employs a specific network to handle serialized AST and generate semantic features.(vi)DPCNN [12]: This method utilizes a CNN to process serialized AST and extract semantic features. It also integrates handcrafted features into the process.(vii)TCNN [5]: It added a matching layer based on CNN for feature mapping and utilized MMD to mitigate differences between various projects.(viii)MANN [17]: This method employs serialized AST and handcrafted feature generation vectors. It implements multicore MMD to align the feature distributions across different projects.(ix)ADA [20]: It is based on adversarial training and uses two classifiers to detect target project samples that are far away from the support vectors of source project samples. This is done to obtain the relationship between these samples and classification boundaries and perform distribution alignment.

This paper utilizes PyTorch to replicate all baseline models, excluding TCNN, as its source code is publicly available. Following that, the paper retrains these baseline models under approximately similar experimental conditions. Taking into account the class imbalance challenge in CPDP [40], ADA actually uses oversampling for both source and target projects. However, we believe that oversampling the target project actually uses the label information leaked by the target project, which is deemed unreasonable. Therefore, we apply the random oversampling method to source projects in all models without any manipulation of the target projects. This is done to ensure the utmost fairness in the comparative experiments.

4.4. Statistical Analysis Methods

To ascertain the significance of the performance comparison results, this chapter employs the Wilcoxon signed-rank test. It is a nonparametric statistical hypothesis test for the evaluation index. In all test cases, the null hypothesis posits that there is no statistical discrepancy in the performance results between the two compared methods. The chosen statistical significance level is set at 0.05. A p-value below 0.05 indicates a noticeable distinction, while a p-value above 0.05 suggests that the distinction is not statistically significant.

To facilitate pairwise comparisons between methods, this chapter employs a “Win/Tie/Loss” analysis. This analysis of a specific performance metric between our method and the comparison method yields three possible outcomes. These outcomes indicate the number of instances where our method outperforms, performs equivalently to, or underperforms compared to the comparison method, respectively.

Additionally, this paper applies a variant of the Scott-Knott effect size difference (ESD) test [41] to quantify the effect size of the discrepancy between the two methods. The Scott-Knott ESD is evaluated based on this measure. All compared models are arranged in order. The Scott-Knott ESD test is a mean relative analysis that utilizes hierarchical clustering to categorize methods with statistically noticeable distinctions into distinct clusters or groups.

4.5. Research Question

To evaluate the efficacy and performance of the proposed BATM method, we discuss the following two research questions (RQs).RQ1: Has the BATM method achieved better predictive performance compared to the relevant CPDP methods?Motivation: Comparison with the CPDP benchmark model is an intuitive way to prove the effectiveness of BATM. We selected nine benchmark models for comparative experiments with BATM.RQ2: How do BATM components affect its prediction performance?Motivation: The transfer learning method is an important skill for handling CPDP tasks. We must confirm whether the relevant components in the BATM model can effectively help the model transfer the learned knowledge, thereby improving the predictive performance of the model.

5. Experimental Results

This section carefully designs and executes a substantial number of experiments to validate the accuracy and efficacy of the model put forward in this paper. It also examines the prediction performance of BATM founded on the obtained experimental results. For the experimental results Tables 49, each row related to method performance contains a bold value, indicating the best-performing method in the same target project.

5.1. RQ1: Has the BATM Method Achieved Better Predictive Performance Compared to the Relevant CPDP Methods?

Tables 46 present the comparative experimental outcomes on the PROMISE dataset. It is evident that the BATM model put forward in this paper demonstrates noteworthy performance, achieving average scores of 0.533, 0.714, and 0.698 in the measure, balanced accuracy, and AUC, respectively. This places it at the forefront among all compared models. In terms of Win/Tie/Lose (W/T/L) analysis, BATM outperforms the baseline method in measure for a minimum of 8 projects, in balanced accuracy for a minimum of 9 projects, and in AUC for a minimum of 10 projects. This demonstrates that our proposed method exhibits superior prediction performance compared to the baseline method across most projects. To provide a clearer illustration of how our method surpasses other baseline methods, we quantify the improvement in model performance.

From the measure, BATM achieves optimal performance in most of the projects and improves the performance by 8.0%–19.8% on average over the benchmark models. Specifically, BATM improves 14.5% on average over LR, NNFilter, TCA, and TCA+ using only handcrafted features; 16.5% over DBN using only semantic features; 9.4% over DPCNN using joint features, and 8.9% over TCNN, MANN, and ADA using joint features and dealing with the differences in the distributions among data.

In terms of balanced accuracy, BATM achieves superior performance in most projects, with an average performance that is 10.1%–38.3% higher than the baseline model. It is worth noting that the balanced accuracy of the BATM model on projects Forrest and Log4j only reached 0.403 and 0.380, respectively. This article believes that it may be because the defect rate of Forrest is only 6.3%, and the defect rate of Log4j is as high as 92.2%. However, they are projects with relatively small sample sizes in the PROMISE data set used, with only 32 and 205 code instances, respectively. The class imbalance phenomenon is extremely serious. Even if random oversampling is used for the source project, it will simply repeat the sample. A minority class that lacks diversity will make BATM fall into the dilemma of not learning generalization knowledge and causing overfitting. However, the defect rate of the Xalan project is as high as 98.8%, with 909 data samples. This allows BATM to learn enough knowledge in the project, resulting in a balanced accuracy of up to 0.796, which is at least 52.8% higher than the baseline model. This demonstrates that BATM can still perform well even with imbalanced data distributions, provided there are sufficient samples.

From an AUC perspective, BATM achieves optimal performance in all projects except for Forrest, with an average performance that is 7.1%–29.5% higher than the baseline model. In comparison with models employing joint features, BATM improves DPCNN, TCNN, MANN, and ADA by 17.8%, 21.9%, 16.8%, and 7.1%, respectively.

From the standpoint of statistical hypothesis testing, the p-values derived from the Wilcoxon signed-rank test for the three metric indicators of all comparison models are consistently below 0.05. This indicates that the differences between these models and the BATM method are statistically significant at the 95% confidence level.

To visually illustrate the differences between BATM and the comparative models, this paper employs a box plot based on the Scott-Knott ESD test. The methods are categorized and ranked based on the Scott-Knott ESD test results. In Figure 4, the orange line represents the median of the method, and the green triangle represents the mean. Methods toward the front are considered better, and there is not a significant difference in performance among several methods within the same color box. In terms of rankings, BATM secures the top position.

Compared to CPDP methods based on unimodal distribution, BATM excels at capturing the multimodal distribution of code instances. Additionally, employing MDD not only effectively aligns the distribution but also facilitates better differentiation of defective instances. This paper posits that this is a primary factor contributing to BATM’s superior predictive performance. Through a detailed and comprehensive observation of the experimental results, we can answer RQ 1: Compared with the related CPDP method, the BATM method achieves better prediction performance.

5.2. RQ2: How Do BATM Components Affect Its Prediction Performance?

To gain a deeper understanding of the impact of various components on the model, we performed ablation experiments to assess the performance effects of each component in BATM. These components include handcrafted features (HF), multilinear conditioning (MC), and the MDD method. Specifically, NoHF represents a model that does not utilize handcrafted features for training; No-MC represents a model without multilinear conditioning, and No-MDD represents a model without MDD.

As evident from Tables 79, the removal of any of these components results in a significant drop in the defect predictor’s performance. In terms of the measure, BATM attains an average performance of 0.533. In contrast, models without handcrafted features (No-HF), without multilinear conditioning (No-MC), and without maximum density divergence (No-MDD) only achieve average performances of 0.489, 0.434, and 0.462, respectively. This represents performance drops of 8.26%, 18.6%, and 13.3%, respectively. Concerning balanced accuracy, BATM achieves an average performance of 0.714, while No-HF, No-MC, and No-MDD achieve average performances of 0.677, 0.641, and 0.672, respectively. In terms of AUC measurement, BATM attains an average performance of 0.698, whereas No-HF, NoMC, and No-MDD only reach average performances of 0.652, 0.611, and 0.648, respectively. Figure 5 illustrates the outstanding performance of BATM through a box plot.

The absence of handcrafted features leads to the loss of crucial feature information. This hinders the model’s ability to comprehensively learn reliable information, thereby impacting the prediction performance. Without considering the aligned multimodal distribution, regardless of the amount of information the model learns, it may only capture noise. Consequently, the No-MC model experiences the most significant performance decrease compared to other ablation models. The omission of the MDD method means that, even if the model captures the multimodal distribution of the data, it cannot effectively measure distribution differences. Consequently, the model lacks the ability to gradually align these differences, resulting in a significant decline in prediction performance.

Through these experiments, the necessity of combining handcrafted features and employing a multilinear conditioning method based on MDD is verified. Through the above discussion, we are able to answer RQ2: The components of BATM contribute to its good prediction performance.

6. Discussions

In this section, we explore diverse issues related to the BATM approach and discuss potential threats to the validity of this work.

6.1. How Do the Parameter and the Number of Heads of Attention Affect the Performance of the BATM?

In this section, we design correlation experiments to explore the effect of the weight parameter of MDD and the number of heads of the transformer on BATM.

For the weight parameter of the MDD, we vary the value of within the range of {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 1.5, 2}. Subsequently, we report the corresponding experimental results. To achieve this, we perform 30 random runs and present the average measure, balanced accuracy, and AUC for 110 CPDP task pairs. This paper selects the parameter values through a comparison of model performance under different values of the parameter . As depicted in Figure 6(a), our proposed BATM method achieves optimal results when is set to 0.01. Consequently, we adopt for our experiments. It is noteworthy that 0.01 is a relatively small value. The significance of MDD with minimal weight may be questioned. In light of this concern, we can examine cross-entropy or other entropy-related losses alongside MDD losses. Notably, while entropy involves the logarithm of the output, MDD loss is the sum of squared distances. This leads us to guess that the absolute value of the MDD loss may be considerably larger than the entropy-related loss. Consequently, assigning a smaller weight to the MDD loss guarantees that it is reweighted to the matching magnitude as the entropy-related loss.

For the number of heads of the transformer, we varied it within the range of {2, 4, 8, 16, 32, 64}. Then, we conducted experiments on the PROMISE dataset and reported the experimental results. As displayed in Figure 6(b), the model performance is optimal when the number of attention heads in the transformer is set to 16. If the number of attention heads is too small, each head’s capacity to learn information is limited, and it may not fully utilize the input sequence’s information. Conversely, if the number of attention heads is too large, especially with a small dataset like PROMISE, it may lead to overfitting the training data and decreased generalization ability. In this study, setting the number of attention heads to 16 allows for both comprehensive learning of the input sequence’s information and good generalization ability. This is exactly why we decided to set the number of heads of the transformer to 16.

6.2. Whether the Model Can Learn Useful Knowledge?

In deep learning, fluctuations in loss values signify the model’s ability to progressively capture essential features and patterns in the data. As illustrated in Figure 7, the line chart depicting the change in loss value reveals a gradual decrease until it plateaus. This observation indicates that our model effectively learns valuable knowledge from the data in a gradual manner until the algorithm reaches convergence. Hence, this article sets the default number of model iterations to 30. This choice strikes a balance between ensuring that the model acquires adequate knowledge and minimizing training time.

6.3. How Does BATM Work by Aligning Multimodal Distributions?

Addressing the issue raised concerning features with multimodal distributions, we employ t-SNE for visualizing distribution alignment [42]. t-SNE, a nonlinear dimensionality reduction technique, aids in visualizing the intricate structure of high-dimensional data. By observing the clustering patterns in the t-SNE dimensionality reduction space, we can gain insights into the approximate distribution of the data. In the presence of a multimodal distribution, data points are expected to be distributed across multiple clusters in the reduced dimension space.

As depicted in Figure 8, the sample features of both the source and target projects form two distinct irregular strip clusters, with some less apparent clusters. This suggests that their probability distribution exhibits at least two clear peaks, indicating an overall appearance of a multimodal distribution. Across continuous iterations from the first epoch to the 10th and then to the 25th epoch, the clusters of source and target project features gradually converge and overlap. This reflects the progressive alignment of distributions between them.

By the 25th epoch, the feature clusters of the source and target projects significantly overlap, signifying that their distributions have been effectively aligned. This observation aligns with the conclusion drawn in Figure 7, indicating that the algorithm reaches convergence after the 25th epoch. In summary, BATM demonstrates reliability and effectiveness in aligning the multimodal distributions of features across different projects.

6.4. How Different Classifiers Affect the Performance of BATM?

In this subsection, we will discuss why the LR classifier is chosen as the base classifier for BTAM. LR is a commonly used base classifier in CPDP, which is more stable and less likely to overfit data when dealing with unbalanced and small datasets. In order to verify the appropriateness of using LR as a base classifier for BATM, we evaluate the impact of different classifiers on the prediction performance of the model. In addition to LR, we also chose four other classical classifiers, including random forest (RF), K-nearest neighbor (KNN), support vector machine (SVM), and naive Bayes (NB). For RF, KNN, and SVM, we have utilized the common defaults in CPDP, setting the number of decision trees for RF to 100, the value of k for KNN to 5, and using the Gaussian radial basis function as the kernel function for SVM.

In Figure 9, we can observe that the choice of different base classifiers impacts the performance of BATM. When comparing the results of the model on three evaluation metrics using five different base classifiers, we find that LR achieves the best performance in terms of measure, balanced accuracy, and AUC. SVM comes closest to LR in terms of measure but significantly lags behind in terms of balanced accuracy and AUC. Additionally, RF and KNN produce comparable results across all three evaluation metrics, while NB is unable to compete with the remaining four base classifiers on any of the three evaluation metrics. To summarize, LR is more suitable as a base classifier for BATM.

6.5. Threats to Validity

While the BATM method proposed in this paper has shown commendable performance in the CPDP task, it still exhibits the following limitations:(i)During the experiment, due to the impracticality of evaluating all possible parameter values, our exploration was limited to the impact of some values of the MDD weight parameter on the model performance. However, it is conceivable that there might be more optimal weight parameter values that could result in improved prediction performance.(ii)For models lacking open-source code, we meticulously implemented them based on the information provided in the respective papers. To ensure a fair comparison, we employed uniform data preprocessing methods and an LR-based classifier. We also took care to select 11 open-source projects from the public benchmark dataset, retraining the models for consistency. However, it is important to note that our implementation may not comprehensively capture all details of the baseline methods.(iii)To better reveal the multimodal distribution features of the joint features formed by semantic features and handcrafted features, we deliberately chose 11 projects from the PROMISE benchmark dataset for our experiments. However, it is crucial to note that the 11 open-source projects are exclusively coded in Java. Applying our proposed approach to commercial software projects or projects developed in various programing languages may lead to diverse outcomes. The efficacy of our proposed method requires validation on more diverse datasets in future investigations.

7. Conclusions

This paper proposes BATM, a novel CPDP model. BATM effectively addresses the challenge of aligning multimodal distributions between different projects. Initially, the project source codes are parsed into ASTs to obtain token vectors, which are then converted into integer vectors. Next, the transformer network is utilized to acquire semantic features, which are then integrated with handcrafted features to form joint features. Subsequently, the generator and classifier perform adversarial training on the source projects. Afterward, the classifier is used to classify the target projects and assign pseudo-labels. Additionally, the uncertainty of the pseudo-label is adjusted to enhance its accuracy. In short, multilinear conditioning is employed to capture the multimodal distribution of sample features, and MDD is used to measure the distribution differences between projects. This process aligns the distributions and maximizes the density within the same category, ultimately improving the prediction accuracy. Compared to nine other methods, BATM shows improvements of at least 6.6%, 26.2%, and 14.6% in metric, balanced accuracy, and AUC metric, respectively. BATM also outperforms other methods on most target projects, indicating its exceptional prediction accuracy and generalization capability.

While the method we propose has shown effective performance in the CPDP task, there is still potential for enhancement. Moving forward, we intend to augment the size and diversity of the dataset to mitigate the influence of the model performance. Additionally, we plan to implement more suitable sampling techniques and data augmentation strategies, especially for source projects with class imbalance or small sample sizes. This is to further improve the model’s predictive accuracy.

Data Availability

The PROMISE dataset we use can be accessed at https://doi.org/10.1145/1868328.1868342. The implementation of the proposed BATM model is available from the corresponding author upon a reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest in this work.