G Algorithms
EN-ALR:
Elastic Net is a regularizations method by combining the penalty from LASSO (\(\ell_1\) penalty) and Ridge (\(\ell_2\) penalty) regression, which aims to avoid overfitting at the cost of increased bias. It enables automatic variable selection (a feature of LASSO) and avoids the limitation of LASSO regression at the same time.(Zou and Hastie 2005a)
For logistic regression, the objective function for the penalized logistic regression uses the following log-likelihood: \[ \min _{\left(\beta_{0}, \beta\right) \in \mathbb{R}^{p+1}}-\left[\frac{1}{N} \sum_{i=1}^{N} y_{i} \cdot\left(\beta_{0}+x_{i}^{T} \beta\right)-\log \left(1+e^{\left(\beta_{0}+x_{i}^{T} \beta\right)}\right)\right]+\lambda\left[(1-\alpha)\|\beta\|_{2}^{2} / 2+\alpha\|\beta\|_{1}\right] \]
Wherein \(\beta_0\) and \(\beta\) are coefficients in the generalized linear model, \(y_i\) is the binary outcome for the ith individual, \(x_i\) is the vector of covariates of the ith individual, \(\|\beta\|_{1}\) is the \(\ell_1\) penalty on coefficients of \(x_i\), i.e. \(\beta\), and \(\|\beta\|_{2}^2\) is the \(\ell_2\) penalty on \(\beta\). The two hyperparameters: \(\alpha\) and \(\lambda\) control the penalty function, wherein \(\alpha\) bridges the gap between LASSO (\(\alpha=1\)) and Ridge (\(\alpha=0\)) and \(\lambda\) controls the overall strength of the penalty.
XGBoost:
XGBoost refers to “Extreme Gradient Boosting”, which is a fast implementation of a gradient boosting algorithm that uses a gradient boosting framework (T. Chen and Guestrin 2016b). It has been successfully used in many applications and becomes the winning solution for best predictive performance in numerous competitions.(Nielsen 2016) Hyperparameter tuning is the key to achieving accurate prediction, however, the cost is computation time. Therefore, to balance accuracy and efficiency, three hyperparameters: max_depth, eta, and nrounds of top importance were finely tuned while default values were used for other hyperparameters. The parameter max_depth refers to the maximum depth of a tree, increasing this value will result in a more complex model and more likely to overfit. eta stands for the step size shrinkage used in the update to prevent overfitting and nrounds controls the maximum number of iterations.
Ensemble:
This method combines multiple algorithms to generate a predicted risk with better predictive performance. Typically, the predicted risks were incorporated using weights which can be tuned as well. In this project, the weights for both algorithms (EN-ALR and XGBoost) were set at 0.5.
Hyperparameter tuning:
50 hyperparameter settings for EN-ALR and XGBoost respectively were randomly selected from hyperparameter spaces (EN-ALR: \(\alpha\in[0.05,\ 0.3]\), and \(\lambda\in\left[0.05,0.3\right]\); XGBoost: max_depth \(\in[5,\ 30]\), eta \(\in\left[0.1,\ 0.5\right]\), and nrounds \(\in\left[10,\ 150\right]\)) which were obtained from a manually coarse tuning. Then the optimal hyperparameter setting for each algorithm was determined from the 50 settings based on a weighted sum of the AUC, AP, and sBrS.
A weighted sum of AUC, AP, and sBrS:
AUC, AP, and sBrS are the metrics used to evaluate models. To incorporate the three metrics, an equal-weighted sum of the three metrics was used to find the optimal hyperparameters. In addition, to avoid one metric dominate the rank of the weighted sum due to its magnitude in the 50 hyperparameter settings, AUC, AP, and sBrS were scaled to [0, 1] before being weighted, i.e.
\[ \begin{align} A U C_{\text {scaled }} &=[A U C-\min (A U C)] \times \frac{1}{\max (\mathrm{AUC})-\min (\mathrm{AUC})} \\ A P_{\text {scaled}} &=[A P-\min (A P)] \times \frac{1}{\max (A P)-\min (A P)} \\ S \text {BrS}_{\text {scaled}} &=[\sin S-\min (s B r S)] \times \frac{1}{\max (s B r S)-\min (s B r S)} \end{align} \]
Then the weighted sum of the three metrics can be expressed as:
\[ \frac{1}{3}(AUC_{scaled} + AP_{scalied} + sBrS_{scalued}) \]
Modification of predicted risks
It should be noted that the predicted risks do not have a strictly monotone increasing relationship with ages, as the prediction models for different ages were developed separately. To avoid the occasional decrease in predicted risks, we force the predicted risks at age A to be equal to or greater than the maximum of predicted risks at ages \(\le\) A, i.e.
\[ \begin{align} Risk_A^{modified} &= (risk_A, risk_{A-}) \\ risk_A &= \text{predicted risk by age A} \\ risk_{A-} &= \text{predicted risk by ages younger than A} \\ \end{align} \]
Predictors
Table G.1 listed the predictors used in modeling. For EN-ALR, chemotherapy agents that were rarely used in the study sample, such as busulfan, CCNU, chlorambucil, melphalan, thiotepa, idarubicin, mitoxantrone, were coded as binary Yes/No. In contrast to regression methods, XGBoost, as a tree-based machine learning algorithm, “prefers” continuous variables than categorical variables because it can split it at any point that minimizes the loss function. Therefore, doses of chemotherapy agents were used in developing the XGBoost model.
Variable Description | EN-ALR | XGBoost |
---|---|---|
Race (3 levels) | Categorical | Categorical |
Age at Cancer Diagnosis | Continuous | Continuous |
BMT Indicator | Binary | Binary |
Cancer Diagnosis Type (8 levels) | Categorical | Categorical |
Minimum Ovarian Radiation Dose | Continuous | Continuous |
Radiation doses to pituitary | Continuous | Continuous |
Total body irradiation dose | Continuous | Continuous |
CED | Continuous | Continuous |
BCNU | Continuous | Continuous |
Busulfan | Binary | Continuous |
CCNU | Binary | Continuous |
Chlorambucil | Binary | Continuous |
Cyclophosphamide | Continuous | Continuous |
Ifosfamide | Continuous | Continuous |
Melphalan | Binary | Continuous |
Nitrogen Mustard | Continuous | Continuous |
Procarbazine | Continuous | Continuous |
Thiotepa | Binary | Continuous |
Carboplatin | Continuous | Continuous |
Cis_Platinum | Continuous | Continuous |
Bleomycin | Continuous | Continuous |
Daunorubicin | Continuous | Continuous |
Doxorubicin | Continuous | Continuous |
Idarubicin | Binary | Continuous |
Methotrexate | Continuous | Continuous |
Mitoxantrone | Binary | Continuous |
VM 26 | Continuous | Continuous |
VP 16 | Continuous | Continuous |
Interaction: Age at Cancer Diagnosis and BMT | Continuous | NA |
Interaction: Age at Cancer Diagnosis and Minimum Ovarian RT Dose | Continuous | NA |