slickml.classification#

Package Contents#

Classes#

GLMNetCVClassifier

GLMNet CV Classifier.

XGBoostCVClassifier

XGBoost CV Classifier.

XGBoostClassifier

XGBoost Classifier.

class slickml.classification.GLMNetCVClassifier[source]#

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

GLMNet CV Classifier.

This is wrapper using GLM-Net [glmnet-api] to train a Regularized Linear Model via logitic regression and find the optimal penalty values through N-Folds cross validation. In principle, GLMNet (also known as ElasticNet) can also be used for feature selection and dimensionality reduction using the LASSO (Least Absolute Shrinkage and Selection Operator) Regression part of the alogrithm while reaching a solid solution using the Ridge Regression part of the algorithm.

Parameters:
  • alpha (float, optional) – The stability parameter with a possible values of 0 <= alpha <= 1 where alpha=0.0 and alpha=1.0 will lead to classic Ridge and LASSO regression models, respectively, by default 0.5

  • n_lambda (int, optional) – Maximum number of penalty values to compute, by default 100

  • n_splits (int, optional) – Number of cross validation folds for computing performance metrics and determining lambda_best_ and lambda_max_. If non-zero, must beat least 3, by default 3

  • metric (str, optional) – Metric used for model selection during cross validation. Valid options are "accuracy", "roc_auc" (alias = "auc"), "average_precision", "precision", and "recall". The metric affects the selection of lambda_best_ and lambda_max_. Thus, fitting the same data with different metric methods will result in the selection of different models, by default “auc”

  • scale (bool, optional) – Whether to standardize the input features to have a mean value of 0.0 and standard deviation of 1 prior to fitting. The final coefficients will be on the scale of the original data regardless of this step. Therefore, there is no need to pre-process the data when using scale=True, by default True

  • sparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large sparse datasets. Additionally, this parameter cannot be used along with scale=True where standardizing the feature matrix to have a mean value of zero would turn the feature matrix into a dense matrix, by default False

  • fit_intercept (bool, optional) – Include an intercept term in the model, by default True

  • cut_point (float, optional) – The cut point to use for selecting lambda_best_. Based on this value, the distance between lambda_max_ and lambda_best_ would be cut_point * standard_error(lambda_best_) ``arg_max(lambda) for cv_score(lambda) >= cv_score(lambda_max_) - cut_point * standard_error(lambda_max_), by default 1.0

  • min_lambda_ratio (float, optional) – In combination with n_lambda, the ratio of the smallest and largest values of lambda computed (min_lambda/max_lambda >= min_lambda_ratio), by default 1e-4

  • tolerance (float, optional) – Convergence criteria tolerance, by default 1e-7

  • max_iter (int, optional) – Maximum passes over the data, by default 100000

  • random_state (int, optional) – Seed for the random number generator. The glmnet solver is not deterministic, this seed is used for determining the cv folds.

  • lambda_path (Union[List[float], np.ndarray, pd.Series], optional) – In place of supplying n_lambda, provide an array of specific values to compute. The specified values must be in decreasing order. When None, the path of lambda values will be determined automatically. A maximum of n_lambda values will be computed, by default None

  • max_features (int, optional) – Optional maximum number of features with nonzero coefficients after regularization. If not set, defaults to the number features (X_train.shape[1]) during fit. Note, this will be ignored if the user specifies lambda_path, by default None

fit(X_train, y_train)[source]#

Fits a glmnet.LogitNet to input training data. Proper X_train matrix based on chosen options i.e. sparse_matrix, and scale is being created based on the passed X_train and y_train

predict_proba(X_test, y_test)[source]#

Returns prediction probabilities for the positive class. predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference)

predict(X_test, y_test, threshold=0.5)[source]#

Returns prediction classes based on the threshold. The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics

plot_coeff_path():

Visualizes the coefficients’ paths

plot_cv_results()[source]#

Visualizes the cross-validation results

plot_shap_summary()[source]#

Visualizes Shapley values summary plot

plot_shap_waterfall()[source]#

Visualizes Shapley values waterfall plot

get_shap_explainer()[source]#

Returns the fitted shap.LinearExplainer object

get_params():

Returns parameters

get_intercept():

Returns model’s intercept

get_coeffs():

Returns non-zero coefficients

get_cv_results():

Returns cross-validation results

get_results():

Returns model’s total results

X_train#

Returns training data set

Type:

pd.DataFrame

X_test#

Returns transformed testing data set

Type:

pd.DataFrame

y_train#

Returns the list of training ground truth binary values [0, 1]

Type:

np.ndarray

y_test#

Returns the list of testing ground truth binary values [0, 1]

Type:

np.ndarray

coeff_#

Return the model’s non-zero coefficients

Type:

pd.DataFrame

intercept_#

Return the model’s intercept

Type:

float

cv_results_#

Returns the cross-validation results

Type:

pd.DataFrame

results_#

Returns the model’s total results

Type:

Dict[str, Any]

params_#

Returns model’s fitting parameters

Type:

Dict[str, Any]

shap_values_train_#

Shapley values from LinearExplainer using X_train

Type:

np.ndarray

shap_values_test_#

Shapley values from LinearExplainer using X_test

Type:

np.ndarray

shap_explainer_#

Shap LinearExplainer with independent masker using X_Test

Type:

shap.LinearExplainer

model_#

Returns fitted glmnet.LogitNet model

Type:

glmnet.LogitNet

References

alpha :Optional[float] = 0.5#
cut_point :Optional[float] = 1.0#
fit_intercept :Optional[bool] = True#
lambda_path :Optional[Union[List[float], numpy.ndarray, pandas.Series]]#
max_features :Optional[int]#
max_iter :Optional[int] = 100000#
metric :Optional[str] = auc#
min_lambda_ratio :Optional[float] = 0.0001#
n_lambda :Optional[int] = 100#
n_splits :Optional[int] = 3#
random_state :Optional[int] = 1367#
scale :Optional[bool] = True#
sparse_matrix :Optional[bool] = False#
tolerance :Optional[float] = 1e-07#
__getstate__()#
__post_init__() None[source]#

Post instantiation validations and assignments.

__repr__(N_CHAR_MAX=700)#

Return repr(self).

__setstate__(state)#
fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None[source]#

Fits a glmnet.LogitNet to input training data.

Notes

For the cases that sparse_matrix=True, a CSR format of the input will be used via df_to_csr() function.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)

  • y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)

Returns:

None

get_coeffs(output: Optional[str] = 'dataframe') Union[Dict[str, float], pandas.DataFrame][source]#

Returns model’s coefficients in different format.

Parameters:

output (str, optional) – Output format with possible values of “dataframe” and “dict”, by default “dataframe”

Returns:

Union[Dict[str, float], pd.DataFrame]

get_cv_results() pandas.DataFrame[source]#

Returns model’s cross-validation results.

See also

get_results()

Returns:

pd.DataFrame

get_intercept() float[source]#

Returns the model’s intercept.

Returns:

float

get_params() Dict[str, Any][source]#

Returns model’s parameters.

Returns:

Dict[str, Any]

get_results() Dict[str, Any][source]#

Returns model’s total results.

See also

get_cv_results()

Returns:

Dict[str, Any]

get_shap_explainer() shap.LinearExplainer[source]#

Returns shap.LinearExplainer object.

Returns:

shap.LinearExplainer

plot_coeff_path(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), linestyle: Optional[str] = '-', fontsize: Optional[Union[int, float]] = 12, grid: Optional[bool] = True, legend: Optional[bool] = True, legendloc: Optional[Union[int, str]] = 'center', xlabel: Optional[str] = None, ylabel: Optional[str] = 'Coefficients', title: Optional[str] = None, bbox_to_anchor: Tuple[float, float] = (1.1, 0.5), yscale: Optional[str] = 'linear', save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the GLMNet coefficients’ paths.

Parameters:
  • figsize (tuple, optional) – Figure size, by default (8, 5)

  • linestyle (str, optional) – Linestyle of paths, by default “-”

  • fontsize (Union[int, float], optional) – Fontsize of the title. The fontsizes of xlabel, ylabel, tick_params, and legend are resized with 0.85, 0.85, 0.75, and 0.85 fraction of title fontsize, respectively, by default 12

  • grid (bool, optional) – Whether to show (x,y) grid on the plot or not, by default True

  • legend (bool, optional) – Whether to show legend on the plot or not, by default True

  • legendloc (Union[int, str], optional) – Location of legend, by default “center”

  • xlabel (str, optional) – Xlabel of the plot, by default “-Log(Lambda)”

  • ylabel (str, optional) – Ylabel of the plot, by default “Coefficients”

  • title (str, optional) – Title of the plot, by default “Best {lambda_best} with {n} Features”

  • yscale (str, optiona) – Scale for y-axis (coefficients). Possible options are "linear", "log", "symlog", "logit" [yscale], by default “linear”

  • bbox_to_anchor (Tuple[float, float], optional) – Relative coordinates for legend location outside of the plot, by default (1.1, 0.5)

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

  • **kwargs (Dict[str, Any]) – Key-value pairs of results. results_ attribute can be used

Returns:

Figure, optional

plot_cv_results(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 5, color: Optional[str] = 'red', errorbarcolor: Optional[str] = 'black', maxlambdacolor: Optional[str] = 'purple', bestlambdacolor: Optional[str] = 'navy', linestyle: Optional[str] = '--', fontsize: Optional[Union[int, float]] = 12, grid: Optional[bool] = True, legend: Optional[bool] = True, legendloc: Optional[Union[int, str]] = 'best', xlabel: Optional[str] = None, ylabel: Optional[str] = None, title: Optional[str] = None, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the GLMNet cross-validation results.

Notes

This plotting function can be used along with results_ attribute of any of GLMNetCVClassifier, or GLMNetCVRegressor classes as kwargs.

Parameters:
  • figsize (tuple, optional) – Figure size, by default (8, 5)

  • marker (str, optional) – Marker style of the metric to distinguish the error bars. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 5

  • color (str, optional) – Line and marker color, by default “red”

  • errorbarcolor (str, optional) – Error bar color, by default “black”

  • maxlambdacolor (str, optional) – Color of vertical line for lambda_max_, by default “purple”

  • bestlambdacolor (str, optional) – Color of vertical line for lambda_best_, by default “navy”

  • linestyle (str, optional) – Linestyle of vertical lambda lines, by default “–”

  • fontsize (Union[int, float], optional) – Fontsize of the title. The fontsizes of xlabel, ylabel, tick_params, and legend are resized with 0.85, 0.85, 0.75, and 0.85 fraction of title fontsize, respectively, by default 12

  • grid (bool, optional) – Whether to show (x,y) grid on the plot or not, by default True

  • legend (bool, optional) – Whether to show legend on the plot or not, by default True

  • legendloc (Union[int, str], optional) – Location of legend, by default “best”

  • xlabel (str, optional) – Xlabel of the plot, by default “-Log(Lambda)”

  • ylabel (str, optional) – Ylabel of the plot, by default “{n_splits}-Folds CV Mean {metric}”

  • title (str, optional) – Title of the plot, by default “Best {lambda_best} with {n} Features”

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

  • **kwargs (Dict[str, Any]) – Key-value pairs of results. results_ attribute can be used

Returns:

Figure, optional

plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None[source]#

Visualizes shap beeswarm plot as summary of shapley values.

Notes

This is a helper function to plot the shap summary plot based on all types of shap.Explainer including shap.LinearExplainer for linear models, shap.TreeExplainer for tree-based models, and shap.DeepExplainer deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • plot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”

  • figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”

  • color (str, optional) – Color of plots when plot_type="violin" and plot_type=layered_violin" are “RdBl” color-map while color of the horizontal lines when plot_type="bar" is “#D0AAF3”, by default None

  • cmap (LinearSegmentedColormap, optional) – Color map when plot_type="violin" and plot_type=layered_violin", by default “RdBl”

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None

  • layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10

  • title (str, optional) – Title of the plot, by default None

  • sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True

  • color_bar (bool, optional) – Flag to show a color bar when plot_type="dot" or plot_type="violin"

  • class_names (List[str], optional) – List of class names for multi-output problems, by default None

  • class_inds (List[int], optional) – List of class indices for multi-output problems, by default None

  • color_bar_label (str, optional) – Label for color bar, by default “Feature Value”

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

Returns:

None

plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the Shapley values as a waterfall plot.

Notes

Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • figsize (Tuple[float, float], optional) – Figure size, by default (8, 5)

  • bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”

  • bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5

  • line_color (str, optional) – Color of the line plot, by default “purple”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [2]_, by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 7

  • markeredgecolor (str, optional) – Marker edge color, by default “purple”

  • markerfacecolor (str, optional) – Marker face color, by default “purple”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • title (str, optional) – Title of the plot, by default None

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5, lamb: Optional[numpy.ndarray] = None) numpy.ndarray[source]#

Returns the prediction classes based on the threshold.

Notes

The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics.

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

  • threshold (float, optional) – Inclusive threshold value to binarize y_pred_proba_ to y_pred_ where any value that satisfies y_pred_prob_ >= threshold will set to class=1 (positive class). Note that for ">=" is used instead of ">", by default 0.5

  • lamb (np.ndarray, optional) – Values with shape (n_lambda,) of lambda from lambda_path_ from which to make predictions. If no values are provided (None), the returned predictions will be those corresponding to lambda_best_. The values of lamb must also be in the range of lambda_path_, values greater than max(lambda_path_) or less than min(lambda_path_) will be clipped

Returns:

np.ndarray

predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, lamb: Optional[numpy.ndarray] = None) numpy.ndarray[source]#

Returns the prediction probabilities for the positive class.

Notes

predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference).

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

  • lamb (np.ndarray, optional) – Values with shape (n_lambda,) of lambda from lambda_path_ from which to make predictions. If no values are provided (None), the returned predictions will be those corresponding to lambda_best_. The values of lamb must also be in the range of lambda_path_, values greater than max(lambda_path_) or less than min(lambda_path_) will be clipped

Returns:

np.ndarray

score(X, y, sample_weight=None)#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score (float) – Mean accuracy of self.predict(X) wrt. y.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

class slickml.classification.XGBoostCVClassifier[source]#

Bases: slickml.classification._xgboost.XGBoostClassifier

XGBoost CV Classifier.

This is wrapper using XGBoostClassifier to train a XGBoost [xgboost-api] model with using the optimum number of boosting rounds from the inputs. It used xgboost.cv() model with n-folds cross-validation and train model based on the best number of boosting round to avoid over-fitting.

Parameters:
  • num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200

  • n_splits (int, optional) – Number of folds for cross-validation, by default 4

  • metrics (str, optional) – Metrics to be tracked at cross-validation fitting time with possible values of “auc”, “aucpr”, “error”, “logloss”. Note this is different than eval_metric that needs to be passed to params dict, by default “auc”

  • early_stopping_rounds (int, optional) – The criterion to early abort the xgboost.cv() phase if the test metric is not improved, by default 20

  • random_state (int, optional) – Random seed number, by default 1367

  • stratified (bool, optional) – Whether to use stratificaiton of the targets to run xgboost.cv() to find the best number of boosting round at each fold of each iteration, by default True

  • shuffle (bool, optional) – Whether to shuffle data to have the ability of building stratified folds in xgboost.cv(), by default True

  • sparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this feature cannot be used along with scale_mean=True standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default False

  • scale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in sparse_matrix, scale_mean=False when using sparse_matrix=True, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • scale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • importance_type (str, optional) – Importance type of xgboost.train() with possible values "weight", "gain", "total_gain", "cover", "total_cover", by default “total_gain”

  • params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1}

  • verbose (bool, optional) – Whether to log the final results of xgboost.cv(), by default True

  • callbacks (bool, optional) – Whether to logging standard deviation of metrics on train data and track the early stopping criterion, by default False

fit(X_train, y_train)[source]#

Fits a XGBoost.Booster to input training data. Proper dtrain_ matrix based on chosen options i.e. sparse_matrix, scale_mean, scale_std is being created based on the passed X_train and y_train

predict_proba(X_test, y_test)#

Returns prediction probabilities for the positive class. predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference)

predict(X_test, y_test, threshold=0.5)#

Returns prediction classes based on the threshold. The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics

get_cv_results()[source]#

Returns the mean value of the metrics in n_splits cross-validation for each boosting round

get_params()#

Returns final set of train parameters. The default set of parameters will be updated with the new ones that passed to params

get_default_params()#

Returns the default set of train parameters. The default set of parameters will be used when params=None

get_feature_importance()#

Returns the feature importance of the trained booster based on the given importance_type

get_shap_explainer()#

Returns the shap.TreeExplainer

plot_cv_results()[source]#

Visualizes cross-validation results

plot_shap_summary()#

Visualizes Shapley values summary plot

plot_shap_waterfall()#

Visualizes Shapley values waterfall plot

cv_results_#

The mean value of the metrics in n_splits cross-validation for each boosting round

Type:

pd.DataFrame

feature_importance_#

Features importance based on the given importance_type

Type:

pd.DataFrame

scaler_#

Standardization object when scale_mean=True or scale_std=True unless it is None

Type:

StandardScaler, optional

X_train_#

Fitted and Transformed features when scale_mean=True or scale_std=True. In other case, it will be the same as the passed X_train features

Type:

pd.DataFrame

X_test_#

Transformed features when scale_mean=True or scale_std=True using clf.scaler_ that has be fitted on X_train and y_train data. In other case, it will be the same as the passed X_train features

Type:

pd.DataFrame

dtrain_#

Training data matrix via xgboost.DMatrix(clf.X_train_, clf.y_train)

Type:

xgb.DMatrix

dtest_#

Testing data matrix via xgboost.DMatrix(clf.X_test_, clf.y_test) or xgboost.DMatrix(clf.X_test_, None) when y_test is not available in inference

Type:

xgb.DMatrix

shap_values_train_#

Shapley values from TreeExplainer using X_train_

Type:

np.ndarray

shap_values_test_#

Shapley values from TreeExplainer using X_test_

Type:

np.ndarray

shap_explainer_#

Shap TreeExplainer object

Type:

shap.TreeExplainer

model_#

XGBoost Booster object

Type:

xgboost.Booster

References

__slots__ = []#
callbacks :Optional[bool] = False#
early_stopping_rounds :Optional[int] = 20#
importance_type :Optional[str] = total_gain#
metrics :Optional[str] = auc#
n_splits :Optional[int] = 4#
num_boost_round :Optional[int] = 200#
params :Optional[Dict[str, Union[str, float, int]]]#
random_state :Optional[int] = 1367#
scale_mean :Optional[bool] = False#
scale_std :Optional[bool] = False#
shuffle :Optional[bool] = True#
sparse_matrix :Optional[bool] = False#
stratified :Optional[bool] = True#
verbose :Optional[bool] = True#
__getstate__()#
__post_init__() None[source]#

Post instantiation validations and assignments.

__repr__(N_CHAR_MAX=700)#

Return repr(self).

__setstate__(state)#
fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None[source]#

Fits a XGBoost.Booster to input training data based on the best number of boostring round.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)

  • y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)

See also

xgboost.cv() xgboost.train()

Returns:

None

get_cv_results() pandas.DataFrame[source]#

Returns cross-validiation results.

Returns:

pd.DataFrame

get_default_params() Dict[str, Union[str, float, int]]#

Returns the default set of train parameters.

The default set of parameters will be used when params=None.

See also

get_params()

Returns:

Dict[str, Union[str, float, int]]

get_feature_importance() pandas.DataFrame#

Returns the feature importance of the trained booster based on the given importance_type.

Returns:

pd.DataFrame

get_params() Optional[Dict[str, Union[str, float, int]]]#

Returns the final set of train parameters.

The default set of parameters will be updated with the new ones that passed to params.

Returns:

Dict[str, Union[str, float, int]]

get_shap_explainer() shap.TreeExplainer#

Returns the shap.TreeExplainer object.

Returns:

shap.TreeExplainer

plot_cv_results(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), linestyle: Optional[str] = '--', train_label: Optional[str] = 'Train', test_label: Optional[str] = 'Test', train_color: Optional[str] = 'navy', train_std_color: Optional[str] = '#B3C3F3', test_color: Optional[str] = 'purple', test_std_color: Optional[str] = '#D0AAF3', save_path: Optional[str] = None, display_plot: Optional[bool] = False, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the cross-validation results and evolution of metrics through number of boosting rounds.

Parameters:
  • cv_results (pd.DataFrame) – Cross-validation results

  • figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)

  • linestyle (str, optional) – Style of lines [linestyles-api], by default “–”

  • train_label (str, optional) – Label in the figure legend for the train line, by default “Train”

  • test_label (str, optional) – Label in the figure legend for the test line, by default “Test”

  • train_color (str, optional) – Color of the training line, by default “navy”

  • train_std_color (str, optional) – Color of the edge color of the training std bars, by default “#B3C3F3”

  • test_color (str, optional) – Color of the testing line, by default “purple”

  • test_std_color (str, optional) – Color of the edge color of the testing std bars, by default “#D0AAF3”

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default False

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

plot_feature_importance(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), color: Optional[str] = '#87CEEB', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 10, markeredgecolor: Optional[str] = '#1F77B4', markerfacecolor: Optional[str] = '#1F77B4', markeredgewidth: Optional[Union[int, float]] = 1, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure]#

Visualizes the XGBoost feature importance as bar chart.

Parameters:
  • feature importance (pd.DataFrame) – Feature importance (feature_importance_ attribute)

  • figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)

  • color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 10

  • markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”

  • markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None#

Visualizes shap beeswarm plot as summary of shapley values.

Notes

This is a helper function to plot the shap summary plot based on all types of shap.Explainer including shap.LinearExplainer for linear models, shap.TreeExplainer for tree-based models, and shap.DeepExplainer deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • plot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”

  • figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”

  • color (str, optional) – Color of plots when plot_type="violin" and plot_type=layered_violin" are “RdBl” color-map while color of the horizontal lines when plot_type="bar" is “#D0AAF3”, by default None

  • cmap (LinearSegmentedColormap, optional) – Color map when plot_type="violin" and plot_type=layered_violin", by default “RdBl”

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None

  • layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10

  • title (str, optional) – Title of the plot, by default None

  • sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True

  • color_bar (bool, optional) – Flag to show a color bar when plot_type="dot" or plot_type="violin"

  • class_names (List[str], optional) – List of class names for multi-output problems, by default None

  • class_inds (List[int], optional) – List of class indices for multi-output problems, by default None

  • color_bar_label (str, optional) – Label for color bar, by default “Feature Value”

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

Returns:

None

plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure]#

Visualizes the Shapley values as a waterfall plot.

Notes

Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • figsize (Tuple[float, float], optional) – Figure size, by default (8, 5)

  • bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”

  • bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5

  • line_color (str, optional) – Color of the line plot, by default “purple”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 7

  • markeredgecolor (str, optional) – Marker edge color, by default “purple”

  • markerfacecolor (str, optional) – Marker face color, by default “purple”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • title (str, optional) – Title of the plot, by default None

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5) numpy.ndarray#

Returns the prediction classes based on the threshold.

Notes

The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics.

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

  • threshold (float, optional) – Inclusive threshold value to binarize y_pred_proba_ to y_pred_ where any value that satisfies y_pred_prob_ >= threshold will set to class=1 (positive class). Note that for ">=" is used instead of ">", by default 0.5

Returns:

np.ndarray

predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None) numpy.ndarray#

Returns the prediction probabilities for the positive class.

Notes

predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference).

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

Returns:

np.ndarray

score(X, y, sample_weight=None)#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score (float) – Mean accuracy of self.predict(X) wrt. y.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

class slickml.classification.XGBoostClassifier[source]#

Bases: slickml.base.BaseXGBoostEstimator, sklearn.base.ClassifierMixin

XGBoost Classifier.

This is a wrapper using XGBoost classifier to train a XGBoost [xgboost-api] model using the number of boosting rounds from the inputs. This is also the base class for XGBoostCVClassifier.

Parameters:
  • num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200

  • sparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this parameter cannot be used along with scale_mean=True standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default False

  • scale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in sparse_matrix, scale_mean=False when using sparse_matrix=True, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • scale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • importance_type (str, optional) – Importance type of xgboost.train() with possible values "weight", "gain", "total_gain", "cover", "total_cover", by default “total_gain”

  • params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1}

fit(X_train, y_train)[source]#

Fits a XGBoost.Booster to input training data. Proper dtrain_ matrix based on chosen options i.e. sparse_matrix, scale_mean, scale_std is being created based on the passed X_train and y_train

predict_proba(X_test, y_test)[source]#

Returns prediction probabilities for the positive class. predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference)

predict(X_test, y_test, threshold=0.5)[source]#

Returns prediction classes based on the threshold. The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics

get_params()[source]#

Returns final set of train parameters. The default set of parameters will be updated with the new ones that passed to params

get_default_params()[source]#

Returns the default set of train parameters. The default set of parameters will be used when params=None

get_feature_importance()[source]#

Returns the feature importance of the trained booster based on the given importance_type

get_shap_explainer()[source]#

Returns the shap.TreeExplainer

plot_shap_summary()[source]#

Visualizes Shapley values summary plot

plot_shap_waterfall()[source]#

Visualizes Shapley values waterfall plot

feature_importance_#

Features importance based on the given importance_type

Type:

pd.DataFrame

scaler_#

Standardization object when scale_mean=True or scale_std=True unless it is None

Type:

StandardScaler, optional

X_train_#

Fitted and Transformed features when scale_mean=True or scale_std=True. In other case, it will be the same as the passed X_train features

Type:

pd.DataFrame

X_test_#

Transformed features when scale_mean=True or scale_std=True using clf.scaler_ that has be fitted on X_train and y_train data. In other case, it will be the same as the passed X_train features

Type:

pd.DataFrame

dtrain_#

Training data matrix via xgboost.DMatrix(clf.X_train_, clf.y_train)

Type:

xgb.DMatrix

dtest_#

Testing data matrix via xgboost.DMatrix(clf.X_test_, clf.y_test) or xgboost.DMatrix(clf.X_test_, None) when y_test is not available in inference

Type:

xgb.DMatrix

shap_values_train_#

Shapley values from TreeExplainer using X_train_

Type:

np.ndarray

shap_values_test_#

Shapley values from TreeExplainer using X_test_

Type:

np.ndarray

shap_explainer_#

Shap TreeExplainer object

Type:

shap.TreeExplainer

model_#

XGBoost Booster object

Type:

xgboost.Booster

References

__slots__ = []#
importance_type :Optional[str] = total_gain#
num_boost_round :Optional[int] = 200#
params :Optional[Dict[str, Union[str, float, int]]]#
scale_mean :Optional[bool] = False#
scale_std :Optional[bool] = False#
sparse_matrix :Optional[bool] = False#
__getstate__()#
__post_init__() None[source]#

Post instantiation validations and assignments.

__repr__(N_CHAR_MAX=700)#

Return repr(self).

__setstate__(state)#
fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None[source]#

Fits a XGBoost.Booster to input training data.

Notes

Proper dtrain_ matrix based on chosen options i.e. sparse_matrix, scale_mean, scale_std is being created based on the passed X_train and y_train.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)

  • y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)

See also

xgboost.train()

Returns:

None

get_default_params() Dict[str, Union[str, float, int]][source]#

Returns the default set of train parameters.

The default set of parameters will be used when params=None.

See also

get_params()

Returns:

Dict[str, Union[str, float, int]]

get_feature_importance() pandas.DataFrame[source]#

Returns the feature importance of the trained booster based on the given importance_type.

Returns:

pd.DataFrame

get_params() Optional[Dict[str, Union[str, float, int]]][source]#

Returns the final set of train parameters.

The default set of parameters will be updated with the new ones that passed to params.

Returns:

Dict[str, Union[str, float, int]]

get_shap_explainer() shap.TreeExplainer[source]#

Returns the shap.TreeExplainer object.

Returns:

shap.TreeExplainer

plot_feature_importance(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), color: Optional[str] = '#87CEEB', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 10, markeredgecolor: Optional[str] = '#1F77B4', markerfacecolor: Optional[str] = '#1F77B4', markeredgewidth: Optional[Union[int, float]] = 1, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the XGBoost feature importance as bar chart.

Parameters:
  • feature importance (pd.DataFrame) – Feature importance (feature_importance_ attribute)

  • figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)

  • color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 10

  • markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”

  • markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None[source]#

Visualizes shap beeswarm plot as summary of shapley values.

Notes

This is a helper function to plot the shap summary plot based on all types of shap.Explainer including shap.LinearExplainer for linear models, shap.TreeExplainer for tree-based models, and shap.DeepExplainer deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • plot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”

  • figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”

  • color (str, optional) – Color of plots when plot_type="violin" and plot_type=layered_violin" are “RdBl” color-map while color of the horizontal lines when plot_type="bar" is “#D0AAF3”, by default None

  • cmap (LinearSegmentedColormap, optional) – Color map when plot_type="violin" and plot_type=layered_violin", by default “RdBl”

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None

  • layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10

  • title (str, optional) – Title of the plot, by default None

  • sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True

  • color_bar (bool, optional) – Flag to show a color bar when plot_type="dot" or plot_type="violin"

  • class_names (List[str], optional) – List of class names for multi-output problems, by default None

  • class_inds (List[int], optional) – List of class indices for multi-output problems, by default None

  • color_bar_label (str, optional) – Label for color bar, by default “Feature Value”

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

Returns:

None

plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the Shapley values as a waterfall plot.

Notes

Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the predict_proba() to make sure the X_test is being instansiated or set validation=False.

Parameters:
  • validation (bool, optional) – Whether to calculate Shap values of using the validation data X_test or not. When validation=False, Shap values are calculated using X_train, be default True

  • figsize (Tuple[float, float], optional) – Figure size, by default (8, 5)

  • bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”

  • bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5

  • line_color (str, optional) – Color of the line plot, by default “purple”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 7

  • markeredgecolor (str, optional) – Marker edge color, by default “purple”

  • markerfacecolor (str, optional) – Marker face color, by default “purple”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • max_display (int, optional) – Limit to show the number of features in the plot, by default 20

  • title (str, optional) – Title of the plot, by default None

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5) numpy.ndarray[source]#

Returns the prediction classes based on the threshold.

Notes

The default threshold=0.5 might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by using BinaryClassificationMetrics.

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

  • threshold (float, optional) – Inclusive threshold value to binarize y_pred_proba_ to y_pred_ where any value that satisfies y_pred_prob_ >= threshold will set to class=1 (positive class). Note that for ">=" is used instead of ">", by default 0.5

Returns:

np.ndarray

predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None) numpy.ndarray[source]#

Returns the prediction probabilities for the positive class.

Notes

predict_proba() only reports the probability of the positive class, while the sklearn API returns for both and slicing like pred_proba[:, 1] is needed for positive class predictions. Additionally, y_test is optional while the targets might not be available in validiation (inference).

Parameters:
  • X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)

  • y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)

Returns:

np.ndarray

score(X, y, sample_weight=None)#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score (float) – Mean accuracy of self.predict(X) wrt. y.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.