Abstract
In the era of precision medicine, biomarker plays a vital role in drug clinical trials. It helps select the patients more likely to respond to the therapy and increases the possibility of success of the trial. Model selection is critical in the development of the algorithm. Traditional model selection metrics ignore two clinical utilities of the biomarker in drug clinical trials, one is its ability to distinguish positive and negative patients in terms of treatment effect and another is the total cost of the biomarker-based drug clinical trial. We proposed a new model selection metric that estimates the above two clinical utilities of biomarker detection algorithms without the need for a real drug clinical trial. In the simulation, we will compare the proposed metric with the widely used ROC-based metric in selecting the optimal cutoff value for the model and discuss which one to choose under various circumstances.
1. Introduction
In the era of precision medicine, biomarker plays a pivotal role in drug clinical trials. It helps select the patients more likely to respond to the therapy and increases the possibility of success of the trial. For example, EGFR (19del, L858R, and T790M) mutations are applied as biomarkers in clinical trials of Gefitinib, Erlotinib, and Osimertinib [1–3] and ALK mutation guides the medication of Crizotinib [4]. In addition to their application in oncology, biomarkers are also frequently utilized in the treatment of Alzheimer’s disease [5] and cardiovascular disease [6].
Machine learning and artificial intelligence methods are gaining popularity in training biomarker detection algorithms. Model selection is critical in the development of the algorithm. It includes the selection of the optimal model from a set of candidate models, the selection of the predictive genes from a large number of genes of the whole genome, and the selection of the optimal cutoff for a continuous biomarker. To conduct model selection, we usually apply a metric to evaluate their performance.
There are various metrics available in the literature. Akaike information criterion (AIC), Bayesian information criterion (BIC), and Deviance information criterion (DIC) measure the likelihood and complexity of the model [7]. Area under the ROC curve (AUC), Youden index, product of sensitivity and specificity, and F1 score give a summary of the model accuracy and share similar characteristics [8, 9]. Decision curve analysis (DCA) defines a utility function considering the risk and benefit of the model [10].
However, the above metrics ignore the clinical utility of biomarker detection algorithms in drug clinical trials. The utility is manifested in two aspects. Firstly, the algorithm should distinguish positive and negative patients in terms of the treatment effect. If no significant difference exists, there is no need to conduct a biomarker-based clinical trial [11]. Secondly, the algorithm should ensure a high treatment effect in the identified biomarker-positive patients so that the trial requires a relatively small sample size and low total cost.
The clinical utility of the algorithm is usually evaluated in a biomarker-based drug clinical trial [12]. There is a huge gap between the preclinical phase and clinical phase in developing a biomarker detection tool. In the preclinical phase, bioinformaticians can hardly select the optimal model considering its long-term impact on the biomarker-based drug clinical trial. In the clinical phase, pharmaceutical companies are forced to conduct a drug clinical trial with uncertainty about the biomarker. Thus, trial failure and substantial financial loss may be caused owing to the inaccurate selection of the biomarker or its cutoff value. Existing solutions focus on optimizing design for the biomarker-based drug clinical trial. Placing two hypotheses on the entire population and biomarker-positive groups can diversify risk [13]. 2 in 1 adaptive design can determine whether to conduct a phase 2/3 seamless clinical trial based on the phase 2 result [14, 15]. Adaptive enrichment design reallocates the patients or adjusts the cutoff of the biomarker after the interim analysis [16]. However, the ideal solution is to select the best model or narrow the scope of the model candidates in the preclinical phase before the phase 3 clinical trial.
In this paper, we proposed a new model selection metric that estimates the above two clinical utilities of biomarker detection algorithms without the need for a real drug clinical trial. We assume that there is a gold-standard method or a reference method G to test a certain kind of biomarker status, and a novel method M is developed to replace G in some instances. For example, tissue biopsy is the gold-standard method for detecting various gene mutations in cancer patients. However, it is invasive and not well-accepted by late-stage patients. Thus, circulating-free tumor DNA (cfDNA) from plasma becomes an alternative. Our mission is to select the optimal model for M by estimating its clinical utilities if it is further applied in the biomarker-based drug clinical trial.
In the simulation, we will compare the proposed metric with the widely used ROC-based metric in selecting the optimal cutoff value for the model and discuss which one to choose under various circumstances.
2. Materials and Methods
2.1. Notations
Assume we would like to conduct model selection for a newly developed biomarker detection algorithm M. We also have an existing reference method G for comparison.
We then generate several indicators measuring the concordance between M and G, including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).where +/− represents the patient’s biomarker status (e.g., M+ indicates biomarker-positive patients identified by M) and Pc is the prevalence of the biomarker in the real world.
Next, we derive the estimated clinical outcome of M+ patients if M is further applied in the drug clinical trial by bridging the clinical outcome of the G-based clinical trial.
Figure 1 illustrates the G-based clinical trial design in which patients are stratified by G and randomized to test (T) and reference (R) groups. Denote δij the clinical outcome of each stratification, where i represents the intervention group (0: reference group and 1: treatment group) and j represents the biomarker status (0: negative and 1: positive). We assume δij is continuous and follows normal (E(δij), ). The expectation and variance of δij are the known parameters and can be found or estimated from the literature or prior clinical trials.

Next, we can derive the expected clinical outcome θij (Figure 2) of patients selected by M in the M-based drug clinical trial with PPV, NPV, and E(δij).

Here, θij can be perceived as the discounted by PPV and NPV.
2.2. Model Selection Metric
2.2.1. Predictability of the Biomarker Detection Algorithm in the M-Based Drug Clinical Trial
If M can significantly differentiate biomarker-positive and negative patients in terms of treatment effect, it can be called a predictive biomarker. We first investigate M’s predictability. Define treatment effect d+: for the biomarker-positive group, treatment effect d−: for the biomarker-negative group, and predicative ability d: d+ − d−.
is the predictive ability of G and is the extent of the predictive ability M reserves. We expect that d is as large as possible, in other words, is as large as possible, or 2-PPV-NPV is as small as possible. 2-PPV-NPV can be regarded as predictability loss, we denote it as L.
2.2.2. Estimated Total Cost of the M-Based Drug Clinical Trial
We estimate the total cost of the drug clinical trial if M is further applied as the biomarker detection tool in Figure 2. In practice, only biomarker-positive patients would be involved, and biomarker-negative ones would be excluded [17].
We first derive the variance of d+: [18] and the standardized treatment effect for biomarker-positive patients.
Then, we can derive key elements in the M-based drug clinical trial: standardized treatment effect, randomized sample size, and screening sample size. The randomized sample size is the number of patients enrolled into the trial. The screening sample size measures how many patients are supposed to be tested by M before enrollment. It is related to randomized sample size, the prevalence of the biomarker Pc in the real world, and the sensitivity of the tool. is type 1 error and is power.
The ultimate model selection metric F is the weighted log sum of L and the estimated total cost in the M-based clinical trial. The biomarker detection algorithms should be fine-tuned to reach a minimal F.where c1 is the unit price for screening one patient, c2 is the total cost of one randomized patient completing the trial, and is the weight controlling the priority to L or total cost.
3. Results
As there are a large number of machine learning algorithms available and various parameters to fine-tune for each model, for simplicity, we show how to select the optimal cutoff value for the model in a two-class classification problem.
Through simulation, we demonstrated how the prevalence of the biomarker Pc and weight influences the metric F and the optimal cutoff value selection. Also, we compared the proposed metric with the ROC-based metric and discussed which one to choose under various circumstances.
3.1. Simulation Settings
Suppose G is an existing reference method to test EGFR mutation with tissue samples and M is a newly developed algorithm to test the same mutation with blood samples. We collected blood samples from 1000 biomarker-positive and 1000 biomarker-negative patients identified by G and acquired the model output Y by M. Our task is to find an optimal cutoff value for M that can transform the continuous Y into a binary one. If Y ≥ cutoff, the patient is M positive, otherwise, the patient is M negative.
Assume model output Y follows N (15, 3) for the G+ patients and follows N (3, 3) for the G-patients. Other known parameters are listed in Table 1.
The above parameters indicate, in a previous G-based drug clinical trial, the treatment effect for the G+ patients is 0.25, the treatment effect for the G-patients is 0.10, the predictability of G is 0.15, and share the same variance of 0.25.
For comparison, we involved a ROC-based metric Gmeans, the geometric mean of sensitivity and specificity, to measure the overall accuracy. The cutoff value is selected by finding the highest Gmeans.
The simulation will be conducted 5000 times. We can summarize the expected sensitivity, specificity, PPV, NPV, standardized treatment effect, randomized sample size, screening sample size, predictability loss of M, estimated total cost in an M-based drug clinical trial, and F score under different prevalence and cutoff levels.
4. Simulation Results
4.1. Model Selection when Pc Varies
In this section, is fixed as 0.5.
Figures 3 and 4 illustrate the F score tendency under different Pc and cutoff values. In Figure 3, Pc is relatively small and F score differences among various Pc levels are huge when the cutoff value is from 7 to 10, while the distance becomes smaller with the cutoff value ranging from 11 to 15. In Figure 4, whatever the cutoff value is, the F score differences are stable. It is also noteworthy that if Pc is critically small, say smaller than 0.1, a mistakenly selected cutoff value will lead to a significant increase in F, indicating a huge predictability loss of M and increased total cost in the M-based drug clinical trial.


Table 2 shows the selected cutoff values with minimal F. As Pc increases, the optimal cutoff value goes down in a stepwise manner.
As cutoff value 9 achieves the highest Gmeans 0.977, we listed the minimal F score, the F score when Gmeans is maximum, and their difference in Table 3. We can learn that the difference becomes smaller as Pc increases, and no difference exists when Pc is from 0.45 to 0.55. It suggests that metric F shows significant superiority over Gmeans in predicting clinical utility when the prevalence of the biomarker is rare while F score and Gmeans can both perform well when the prevalence is large.
We then investigate 2 scenarios in detail when Pc is 0.05 and 0.5.
Table 4 shows the simulation result when Pc is 0.05. We may find that F is minimal when the cutoff value is 13, the corresponding loss of predictability is 0.023, and the total cost is 46,947,320. If we mistakenly select 9 as the optimal one considering its highest Gmeans 0.977, we could spend 11,070,676 more and suffer 20 times more loss of predictability; in this case, M will not be a qualified biomarker detection tool and cannot be applied in the drug clinical trial for the patient’s sake. In addition, if we compare cutoff 14 with 12, we can find that although the sensitivity is smaller by 0.21 and specificity is larger by a subtle 0.001, the F score is still smaller. It reminds us to sacrifice sensitivity and seek a high specificity to achieve a relatively low F when the prevalence of the biomarker is rare.
Table 5 shows the simulation result when Pc is 0.5. Cutoff 9 can achieve the minimal L, minimal cost, minimal F, and maximum Gmeans at the same time. We also learn that cutoffs 8–10 share similar F and Gmeans, and cutoff 9 with a balanced sensitivity and specificity level can lead to a relatively smaller F.
From the above simulations, we can conclude that if the biomarker’s prevalence level is low, the F score shows a significant superiority in selecting the optimal model with the highest predictability and lowest total cost in the drug clinical trial. The traditional ROC-based metric is misleading and will not only cause substantial financial loss but harm patients’ benefits. As the prevalence level increases, we have more flexibility in selecting the optimal cutoff value, and the ROC-based method can reflect the clinical utility as the F score does.
4.2. Model Selection when Varies
W influences the priority to predictability loss and total cost in the F score. In this section, we investigated the influence of on the optimal cutoff value selection and two extreme cases when = 0 and = 1.
Figure 5 presents the optimal cutoff value with minimal F when varies from 0 to 1 by 0.2. Whatever the Pc is, the selected cutoff value is quite stable when ranges from 0.2 to 1. However, the optimal value is higher if = 0, suggesting a different decision mechanism if we only consider the total cost in the M-based drug clinical trial.

In Table 6, we list the optimal cutoff value if we solely take L ( = 1) and total cost ( = 0) as the metric. Also, we calculated the difference of L and cost between the two scenarios. In general, we cannot achieve a minimal L and cost simultaneously most time and should make a compromise. If Pc = 0.05 and = 1, we can decrease L by 0.013 at the cost of 2,754,448 compared to the metrics when = 0, which is apparently not cost-effective. However, as Pc increases, lowering L becomes much cheaper, and whether to seek the highest predictability of the algorithm or the lowest total cost in the further drug clinical trial depends on budget and priority. For example, when Pc = 0.6, can be 1 as it is fair to promote predictability by nearly 0.03 with an extra 374,224.
In conclusion, the model selection is generally robust given various . If Pc is critically small, we have to conduct a cost-effectiveness analysis to decide the weight on biomarker predictability or total cost.
5. Conclusion and Discussion
In this paper, we proposed a new metric for model selection. It estimates the clinical utility of the biomarker detection algorithms and tools without conducting a real clinical trial. The utility involves two elements, one is the model’s ability to distinguish positive and negative patients in terms of treatment effect and another is the total cost of the biomarker-based clinical trial if the algorithm is further applied to filter the patients. Based on the metric, we can select the biomarker detection model which is highly predictive of the treatment effect and ensure the lowest total cost in the drug clinical trial.
Through simulation, we learn the importance of the prevalence of the biomarker. If the prevalence level is critically low, our method shows a significant superiority over the ROC-based metric in selecting the optimal model with the highest predictability and lowest total cost in the biomarker-based drug clinical trial. As the prevalence level increases, both our method and ROC-based metric can perform well in estimating the clinical utility of the biomarker detection algorithms. In addition, model selection is generally robust regardless of weight in the metric. However, if the prevalence of the biomarker is small, we have to consider whether to seek the highest predictability ( = 1) or minimal total cost ( = 0) based on the budget and cost-effectiveness analysis.
It is noteworthy that multiple testing may exist in the model selection, leading to inflation of the type I error. Thus, in calculating the sample size and F score, several strategies can be applied to control inflation. Bonferroni correction is the most popular method when multiple tests are assumed independently. Maximally selected chi-square statistics can be applied to select the optimal cutoff value adjusted for multiple testing [19]. Permutation-based method relaxes the condition of cutoff candidates and performs better [20–22].
The proposed metric is an excellent tool for bridging the preclinical phase and clinical phase in developing a biomarker detection tool. Bioinformaticians can, thus, select the optimal model considering its long-term impact on the biomarker-based drug clinical trial. Also, it can strengthen the cooperation between device manufacturers and pharmaceutical companies and provide useful information for decision-makers on both sides.
Data Availability
The simulation data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The research was funded by the Chinese University of Hong Kong.