slickml.classification
#
Package Contents#
Classes#
GLMNet CV Classifier. |
|
XGBoost CV Classifier. |
|
XGBoost Classifier. |
- class slickml.classification.GLMNetCVClassifier[source]#
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
GLMNet CV Classifier.
This is wrapper using GLM-Net [glmnet-api] to train a Regularized Linear Model via logitic regression and find the optimal penalty values through N-Folds cross validation. In principle, GLMNet (also known as ElasticNet) can also be used for feature selection and dimensionality reduction using the LASSO (Least Absolute Shrinkage and Selection Operator) Regression part of the alogrithm while reaching a solid solution using the Ridge Regression part of the algorithm.
- Parameters:
alpha (float, optional) – The stability parameter with a possible values of
0 <= alpha <= 1
wherealpha=0.0
andalpha=1.0
will lead to classic Ridge and LASSO regression models, respectively, by default 0.5n_lambda (int, optional) – Maximum number of penalty values to compute, by default 100
n_splits (int, optional) – Number of cross validation folds for computing performance metrics and determining
lambda_best_
andlambda_max_
. If non-zero, must beat least 3, by default 3metric (str, optional) – Metric used for model selection during cross validation. Valid options are
"accuracy"
,"roc_auc" (alias = "auc")
,"average_precision"
,"precision"
, and"recall"
. The metric affects the selection oflambda_best_
andlambda_max_
. Thus, fitting the same data with different metric methods will result in the selection of different models, by default “auc”scale (bool, optional) – Whether to standardize the input features to have a mean value of 0.0 and standard deviation of 1 prior to fitting. The final coefficients will be on the scale of the original data regardless of this step. Therefore, there is no need to pre-process the data when using
scale=True
, by default Truesparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large sparse datasets. Additionally, this parameter cannot be used along with
scale=True
where standardizing the feature matrix to have a mean value of zero would turn the feature matrix into a dense matrix, by default Falsefit_intercept (bool, optional) – Include an intercept term in the model, by default True
cut_point (float, optional) – The cut point to use for selecting
lambda_best_
. Based on this value, the distance betweenlambda_max_
andlambda_best_
would becut_point * standard_error(lambda_best_)
``arg_max(lambda) for cv_score(lambda) >= cv_score(lambda_max_) - cut_point * standard_error(lambda_max_), by default 1.0min_lambda_ratio (float, optional) – In combination with
n_lambda
, the ratio of the smallest and largest values of lambda computed(min_lambda/max_lambda >= min_lambda_ratio)
, by default 1e-4tolerance (float, optional) – Convergence criteria tolerance, by default 1e-7
max_iter (int, optional) – Maximum passes over the data, by default 100000
random_state (int, optional) – Seed for the random number generator. The glmnet solver is not deterministic, this seed is used for determining the cv folds.
lambda_path (Union[List[float], np.ndarray, pd.Series], optional) – In place of supplying
n_lambda
, provide an array of specific values to compute. The specified values must be in decreasing order. When None, the path of lambda values will be determined automatically. A maximum ofn_lambda
values will be computed, by default Nonemax_features (int, optional) – Optional maximum number of features with nonzero coefficients after regularization. If not set, defaults to the number features (
X_train.shape[1]
) during fit. Note, this will be ignored if the user specifieslambda_path
, by default None
- fit(X_train, y_train)[source]#
Fits a
glmnet.LogitNet
to input training data. ProperX_train
matrix based on chosen options i.e.sparse_matrix
, andscale
is being created based on the passedX_train
andy_train
- predict_proba(X_test, y_test)[source]#
Returns prediction probabilities for the positive class.
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference)
- predict(X_test, y_test, threshold=0.5)[source]#
Returns prediction classes based on the threshold. The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
- plot_coeff_path():
Visualizes the coefficients’ paths
- get_params():
Returns parameters
- get_intercept():
Returns model’s intercept
- get_coeffs():
Returns non-zero coefficients
- get_cv_results():
Returns cross-validation results
- get_results():
Returns model’s total results
- X_train#
Returns training data set
- Type:
pd.DataFrame
- X_test#
Returns transformed testing data set
- Type:
pd.DataFrame
- y_train#
Returns the list of training ground truth binary values [0, 1]
- Type:
np.ndarray
- y_test#
Returns the list of testing ground truth binary values [0, 1]
- Type:
np.ndarray
- coeff_#
Return the model’s non-zero coefficients
- Type:
pd.DataFrame
- cv_results_#
Returns the cross-validation results
- Type:
pd.DataFrame
- shap_values_train_#
Shapley values from
LinearExplainer
usingX_train
- Type:
np.ndarray
- shap_values_test_#
Shapley values from
LinearExplainer
usingX_test
- Type:
np.ndarray
- shap_explainer_#
Shap LinearExplainer with independent masker using
X_Test
- Type:
shap.LinearExplainer
- model_#
Returns fitted
glmnet.LogitNet
model- Type:
glmnet.LogitNet
References
[markers-api]- alpha :Optional[float] = 0.5#
- cut_point :Optional[float] = 1.0#
- fit_intercept :Optional[bool] = True#
- lambda_path :Optional[Union[List[float], numpy.ndarray, pandas.Series]]#
- max_features :Optional[int]#
- max_iter :Optional[int] = 100000#
- metric :Optional[str] = auc#
- min_lambda_ratio :Optional[float] = 0.0001#
- n_lambda :Optional[int] = 100#
- n_splits :Optional[int] = 3#
- random_state :Optional[int] = 1367#
- scale :Optional[bool] = True#
- sparse_matrix :Optional[bool] = False#
- tolerance :Optional[float] = 1e-07#
- __getstate__()#
- __repr__(N_CHAR_MAX=700)#
Return repr(self).
- __setstate__(state)#
- fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None [source]#
Fits a
glmnet.LogitNet
to input training data.Notes
For the cases that
sparse_matrix=True
, a CSR format of the input will be used viadf_to_csr()
function.- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)
y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)
- Returns:
None
- get_coeffs(output: Optional[str] = 'dataframe') Union[Dict[str, float], pandas.DataFrame] [source]#
Returns model’s coefficients in different format.
- Parameters:
output (str, optional) – Output format with possible values of “dataframe” and “dict”, by default “dataframe”
- Returns:
Union[Dict[str, float], pd.DataFrame]
- get_cv_results() pandas.DataFrame [source]#
Returns model’s cross-validation results.
See also
- Returns:
pd.DataFrame
- get_results() Dict[str, Any] [source]#
Returns model’s total results.
See also
- Returns:
Dict[str, Any]
- get_shap_explainer() shap.LinearExplainer [source]#
Returns
shap.LinearExplainer
object.- Returns:
shap.LinearExplainer
- plot_coeff_path(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), linestyle: Optional[str] = '-', fontsize: Optional[Union[int, float]] = 12, grid: Optional[bool] = True, legend: Optional[bool] = True, legendloc: Optional[Union[int, str]] = 'center', xlabel: Optional[str] = None, ylabel: Optional[str] = 'Coefficients', title: Optional[str] = None, bbox_to_anchor: Tuple[float, float] = (1.1, 0.5), yscale: Optional[str] = 'linear', save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the GLMNet coefficients’ paths.
- Parameters:
figsize (tuple, optional) – Figure size, by default (8, 5)
linestyle (str, optional) – Linestyle of paths, by default “-”
fontsize (Union[int, float], optional) – Fontsize of the title. The fontsizes of xlabel, ylabel, tick_params, and legend are resized with 0.85, 0.85, 0.75, and 0.85 fraction of title fontsize, respectively, by default 12
grid (bool, optional) – Whether to show (x,y) grid on the plot or not, by default True
legend (bool, optional) – Whether to show legend on the plot or not, by default True
legendloc (Union[int, str], optional) – Location of legend, by default “center”
xlabel (str, optional) – Xlabel of the plot, by default “-Log(Lambda)”
ylabel (str, optional) – Ylabel of the plot, by default “Coefficients”
title (str, optional) – Title of the plot, by default “Best {lambda_best} with {n} Features”
yscale (str, optiona) – Scale for y-axis (coefficients). Possible options are
"linear"
,"log"
,"symlog"
,"logit"
[yscale], by default “linear”bbox_to_anchor (Tuple[float, float], optional) – Relative coordinates for legend location outside of the plot, by default (1.1, 0.5)
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
**kwargs (Dict[str, Any]) – Key-value pairs of results.
results_
attribute can be used
- Returns:
Figure, optional
- plot_cv_results(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 5, color: Optional[str] = 'red', errorbarcolor: Optional[str] = 'black', maxlambdacolor: Optional[str] = 'purple', bestlambdacolor: Optional[str] = 'navy', linestyle: Optional[str] = '--', fontsize: Optional[Union[int, float]] = 12, grid: Optional[bool] = True, legend: Optional[bool] = True, legendloc: Optional[Union[int, str]] = 'best', xlabel: Optional[str] = None, ylabel: Optional[str] = None, title: Optional[str] = None, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the GLMNet cross-validation results.
Notes
This plotting function can be used along with
results_
attribute of any ofGLMNetCVClassifier
, orGLMNetCVRegressor
classes askwargs
.- Parameters:
figsize (tuple, optional) – Figure size, by default (8, 5)
marker (str, optional) – Marker style of the metric to distinguish the error bars. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 5
color (str, optional) – Line and marker color, by default “red”
errorbarcolor (str, optional) – Error bar color, by default “black”
maxlambdacolor (str, optional) – Color of vertical line for
lambda_max_
, by default “purple”bestlambdacolor (str, optional) – Color of vertical line for
lambda_best_
, by default “navy”linestyle (str, optional) – Linestyle of vertical lambda lines, by default “–”
fontsize (Union[int, float], optional) – Fontsize of the title. The fontsizes of xlabel, ylabel, tick_params, and legend are resized with 0.85, 0.85, 0.75, and 0.85 fraction of title fontsize, respectively, by default 12
grid (bool, optional) – Whether to show (x,y) grid on the plot or not, by default True
legend (bool, optional) – Whether to show legend on the plot or not, by default True
legendloc (Union[int, str], optional) – Location of legend, by default “best”
xlabel (str, optional) – Xlabel of the plot, by default “-Log(Lambda)”
ylabel (str, optional) – Ylabel of the plot, by default “{n_splits}-Folds CV Mean {metric}”
title (str, optional) – Title of the plot, by default “Best {lambda_best} with {n} Features”
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
**kwargs (Dict[str, Any]) – Key-value pairs of results.
results_
attribute can be used
- Returns:
Figure, optional
- plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None [source]#
Visualizes shap beeswarm plot as summary of shapley values.
Notes
This is a helper function to plot the
shap
summary plot based on all types ofshap.Explainer
includingshap.LinearExplainer
for linear models,shap.TreeExplainer
for tree-based models, andshap.DeepExplainer
deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after thepredict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Trueplot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”
figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”
color (str, optional) – Color of plots when
plot_type="violin"
andplot_type=layered_violin"
are “RdBl” color-map while color of the horizontal lines whenplot_type="bar"
is “#D0AAF3”, by default Nonecmap (LinearSegmentedColormap, optional) – Color map when
plot_type="violin"
andplot_type=layered_violin"
, by default “RdBl”max_display (int, optional) – Limit to show the number of features in the plot, by default 20
feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None
layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10
title (str, optional) – Title of the plot, by default None
sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True
color_bar (bool, optional) – Flag to show a color bar when
plot_type="dot"
orplot_type="violin"
class_names (List[str], optional) – List of class names for multi-output problems, by default None
class_inds (List[int], optional) – List of class indices for multi-output problems, by default None
color_bar_label (str, optional) – Label for color bar, by default “Feature Value”
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
- Returns:
None
- plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the Shapley values as a waterfall plot.
Notes
Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the
predict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Truefigsize (Tuple[float, float], optional) – Figure size, by default (8, 5)
bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”
bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5
line_color (str, optional) – Color of the line plot, by default “purple”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [2]_, by default “o”
markersize (Union[int, float], optional) – Markersize, by default 7
markeredgecolor (str, optional) – Marker edge color, by default “purple”
markerfacecolor (str, optional) – Marker face color, by default “purple”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
max_display (int, optional) – Limit to show the number of features in the plot, by default 20
title (str, optional) – Title of the plot, by default None
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5, lamb: Optional[numpy.ndarray] = None) numpy.ndarray [source]#
Returns the prediction classes based on the threshold.
Notes
The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
.- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
threshold (float, optional) – Inclusive threshold value to binarize
y_pred_proba_
toy_pred_
where any value that satisfiesy_pred_prob_ >= threshold
will set toclass=1 (positive class)
. Note that for">="
is used instead of">"
, by default 0.5lamb (np.ndarray, optional) – Values with shape
(n_lambda,)
of lambda fromlambda_path_
from which to make predictions. If no values are provided (None), the returned predictions will be those corresponding tolambda_best_
. The values of lamb must also be in the range oflambda_path_
, values greater thanmax(lambda_path_)
or less thanmin(lambda_path_)
will be clipped
- Returns:
np.ndarray
- predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, lamb: Optional[numpy.ndarray] = None) numpy.ndarray [source]#
Returns the prediction probabilities for the positive class.
Notes
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference).- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
lamb (np.ndarray, optional) – Values with shape
(n_lambda,)
of lambda fromlambda_path_
from which to make predictions. If no values are provided (None), the returned predictions will be those corresponding tolambda_best_
. The values of lamb must also be in the range oflambda_path_
, values greater thanmax(lambda_path_)
or less thanmin(lambda_path_)
will be clipped
- Returns:
np.ndarray
- score(X, y, sample_weight=None)#
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score (float) – Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.
- class slickml.classification.XGBoostCVClassifier[source]#
Bases:
slickml.classification._xgboost.XGBoostClassifier
XGBoost CV Classifier.
This is wrapper using
XGBoostClassifier
to train a XGBoost [xgboost-api] model with using the optimum number of boosting rounds from the inputs. It usedxgboost.cv()
model with n-folds cross-validation and train model based on the best number of boosting round to avoid over-fitting.- Parameters:
num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200
n_splits (int, optional) – Number of folds for cross-validation, by default 4
metrics (str, optional) – Metrics to be tracked at cross-validation fitting time with possible values of “auc”, “aucpr”, “error”, “logloss”. Note this is different than eval_metric that needs to be passed to params dict, by default “auc”
early_stopping_rounds (int, optional) – The criterion to early abort the
xgboost.cv()
phase if the test metric is not improved, by default 20random_state (int, optional) – Random seed number, by default 1367
stratified (bool, optional) – Whether to use stratificaiton of the targets to run
xgboost.cv()
to find the best number of boosting round at each fold of each iteration, by default Trueshuffle (bool, optional) – Whether to shuffle data to have the ability of building stratified folds in
xgboost.cv()
, by default Truesparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this feature cannot be used along with
scale_mean=True
standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default Falsescale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in
sparse_matrix
,scale_mean=False
when usingsparse_matrix=True
, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. TheStandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falsescale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The
StandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falseimportance_type (str, optional) – Importance type of
xgboost.train()
with possible values"weight"
,"gain"
,"total_gain"
,"cover"
,"total_cover"
, by default “total_gain”params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1}
verbose (bool, optional) – Whether to log the final results of
xgboost.cv()
, by default Truecallbacks (bool, optional) – Whether to logging standard deviation of metrics on train data and track the early stopping criterion, by default False
- fit(X_train, y_train)[source]#
Fits a
XGBoost.Booster
to input training data. Properdtrain_
matrix based on chosen options i.e.sparse_matrix
,scale_mean
,scale_std
is being created based on the passedX_train
andy_train
- predict_proba(X_test, y_test)#
Returns prediction probabilities for the positive class.
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference)
- predict(X_test, y_test, threshold=0.5)#
Returns prediction classes based on the threshold. The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
- get_cv_results()[source]#
Returns the mean value of the metrics in
n_splits
cross-validation for each boosting round
- get_params()#
Returns final set of train parameters. The default set of parameters will be updated with the new ones that passed to
params
- get_default_params()#
Returns the default set of train parameters. The default set of parameters will be used when
params=None
- get_feature_importance()#
Returns the feature importance of the trained booster based on the given
importance_type
- get_shap_explainer()#
Returns the
shap.TreeExplainer
- plot_shap_summary()#
Visualizes Shapley values summary plot
- plot_shap_waterfall()#
Visualizes Shapley values waterfall plot
- cv_results_#
The mean value of the metrics in
n_splits
cross-validation for each boosting round- Type:
pd.DataFrame
- feature_importance_#
Features importance based on the given
importance_type
- Type:
pd.DataFrame
- scaler_#
Standardization object when
scale_mean=True
orscale_std=True
unless it isNone
- Type:
StandardScaler, optional
- X_train_#
Fitted and Transformed features when
scale_mean=True
orscale_std=True
. In other case, it will be the same as the passedX_train
features- Type:
pd.DataFrame
- X_test_#
Transformed features when
scale_mean=True
orscale_std=True
using clf.scaler_ that has be fitted onX_train
andy_train
data. In other case, it will be the same as the passedX_train
features- Type:
pd.DataFrame
- dtrain_#
Training data matrix via
xgboost.DMatrix(clf.X_train_, clf.y_train)
- Type:
xgb.DMatrix
- dtest_#
Testing data matrix via
xgboost.DMatrix(clf.X_test_, clf.y_test)
orxgboost.DMatrix(clf.X_test_, None)
wheny_test
is not available in inference- Type:
xgb.DMatrix
- shap_values_train_#
Shapley values from
TreeExplainer
usingX_train_
- Type:
np.ndarray
- shap_values_test_#
Shapley values from
TreeExplainer
usingX_test_
- Type:
np.ndarray
- shap_explainer_#
Shap TreeExplainer object
- Type:
shap.TreeExplainer
- model_#
XGBoost Booster object
- Type:
xgboost.Booster
References
- __slots__ = []#
- callbacks :Optional[bool] = False#
- early_stopping_rounds :Optional[int] = 20#
- importance_type :Optional[str] = total_gain#
- metrics :Optional[str] = auc#
- n_splits :Optional[int] = 4#
- num_boost_round :Optional[int] = 200#
- params :Optional[Dict[str, Union[str, float, int]]]#
- random_state :Optional[int] = 1367#
- scale_mean :Optional[bool] = False#
- scale_std :Optional[bool] = False#
- shuffle :Optional[bool] = True#
- sparse_matrix :Optional[bool] = False#
- stratified :Optional[bool] = True#
- verbose :Optional[bool] = True#
- __getstate__()#
- __repr__(N_CHAR_MAX=700)#
Return repr(self).
- __setstate__(state)#
- fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None [source]#
Fits a
XGBoost.Booster
to input training data based on the best number of boostring round.- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)
y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)
See also
xgboost.cv()
xgboost.train()
- Returns:
None
- get_cv_results() pandas.DataFrame [source]#
Returns cross-validiation results.
- Returns:
pd.DataFrame
- get_default_params() Dict[str, Union[str, float, int]] #
Returns the default set of train parameters.
The default set of parameters will be used when
params=None
.See also
- Returns:
Dict[str, Union[str, float, int]]
- get_feature_importance() pandas.DataFrame #
Returns the feature importance of the trained booster based on the given
importance_type
.- Returns:
pd.DataFrame
- get_params() Optional[Dict[str, Union[str, float, int]]] #
Returns the final set of train parameters.
The default set of parameters will be updated with the new ones that passed to
params
.See also
- Returns:
Dict[str, Union[str, float, int]]
- get_shap_explainer() shap.TreeExplainer #
Returns the
shap.TreeExplainer
object.- Returns:
shap.TreeExplainer
- plot_cv_results(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), linestyle: Optional[str] = '--', train_label: Optional[str] = 'Train', test_label: Optional[str] = 'Test', train_color: Optional[str] = 'navy', train_std_color: Optional[str] = '#B3C3F3', test_color: Optional[str] = 'purple', test_std_color: Optional[str] = '#D0AAF3', save_path: Optional[str] = None, display_plot: Optional[bool] = False, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the cross-validation results and evolution of metrics through number of boosting rounds.
- Parameters:
cv_results (pd.DataFrame) – Cross-validation results
figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)
linestyle (str, optional) – Style of lines [linestyles-api], by default “–”
train_label (str, optional) – Label in the figure legend for the train line, by default “Train”
test_label (str, optional) – Label in the figure legend for the test line, by default “Test”
train_color (str, optional) – Color of the training line, by default “navy”
train_std_color (str, optional) – Color of the edge color of the training std bars, by default “#B3C3F3”
test_color (str, optional) – Color of the testing line, by default “purple”
test_std_color (str, optional) – Color of the edge color of the testing std bars, by default “#D0AAF3”
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default False
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- plot_feature_importance(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), color: Optional[str] = '#87CEEB', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 10, markeredgecolor: Optional[str] = '#1F77B4', markerfacecolor: Optional[str] = '#1F77B4', markeredgewidth: Optional[Union[int, float]] = 1, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] #
Visualizes the XGBoost feature importance as bar chart.
- Parameters:
feature importance (pd.DataFrame) – Feature importance (
feature_importance_
attribute)figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)
color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 10
markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”
markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None #
Visualizes shap beeswarm plot as summary of shapley values.
Notes
This is a helper function to plot the
shap
summary plot based on all types ofshap.Explainer
includingshap.LinearExplainer
for linear models,shap.TreeExplainer
for tree-based models, andshap.DeepExplainer
deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after thepredict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Trueplot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”
figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”
color (str, optional) – Color of plots when
plot_type="violin"
andplot_type=layered_violin"
are “RdBl” color-map while color of the horizontal lines whenplot_type="bar"
is “#D0AAF3”, by default Nonecmap (LinearSegmentedColormap, optional) – Color map when
plot_type="violin"
andplot_type=layered_violin"
, by default “RdBl”max_display (int, optional) – Limit to show the number of features in the plot, by default 20
feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None
layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10
title (str, optional) – Title of the plot, by default None
sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True
color_bar (bool, optional) – Flag to show a color bar when
plot_type="dot"
orplot_type="violin"
class_names (List[str], optional) – List of class names for multi-output problems, by default None
class_inds (List[int], optional) – List of class indices for multi-output problems, by default None
color_bar_label (str, optional) – Label for color bar, by default “Feature Value”
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
- Returns:
None
- plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] #
Visualizes the Shapley values as a waterfall plot.
Notes
Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the
predict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Truefigsize (Tuple[float, float], optional) – Figure size, by default (8, 5)
bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”
bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5
line_color (str, optional) – Color of the line plot, by default “purple”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 7
markeredgecolor (str, optional) – Marker edge color, by default “purple”
markerfacecolor (str, optional) – Marker face color, by default “purple”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
max_display (int, optional) – Limit to show the number of features in the plot, by default 20
title (str, optional) – Title of the plot, by default None
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5) numpy.ndarray #
Returns the prediction classes based on the threshold.
Notes
The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
.- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
threshold (float, optional) – Inclusive threshold value to binarize
y_pred_proba_
toy_pred_
where any value that satisfiesy_pred_prob_ >= threshold
will set toclass=1 (positive class)
. Note that for">="
is used instead of">"
, by default 0.5
- Returns:
np.ndarray
- predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None) numpy.ndarray #
Returns the prediction probabilities for the positive class.
Notes
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference).- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
- Returns:
np.ndarray
- score(X, y, sample_weight=None)#
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score (float) – Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.
- class slickml.classification.XGBoostClassifier[source]#
Bases:
slickml.base.BaseXGBoostEstimator
,sklearn.base.ClassifierMixin
XGBoost Classifier.
This is a wrapper using XGBoost classifier to train a XGBoost [xgboost-api] model using the number of boosting rounds from the inputs. This is also the base class for
XGBoostCVClassifier
.- Parameters:
num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200
sparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this parameter cannot be used along with
scale_mean=True
standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default Falsescale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in
sparse_matrix
,scale_mean=False
when usingsparse_matrix=True
, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. TheStandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falsescale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The
StandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falseimportance_type (str, optional) – Importance type of
xgboost.train()
with possible values"weight"
,"gain"
,"total_gain"
,"cover"
,"total_cover"
, by default “total_gain”params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1}
- fit(X_train, y_train)[source]#
Fits a
XGBoost.Booster
to input training data. Properdtrain_
matrix based on chosen options i.e.sparse_matrix
,scale_mean
,scale_std
is being created based on the passedX_train
andy_train
- predict_proba(X_test, y_test)[source]#
Returns prediction probabilities for the positive class.
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference)
- predict(X_test, y_test, threshold=0.5)[source]#
Returns prediction classes based on the threshold. The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
- get_params()[source]#
Returns final set of train parameters. The default set of parameters will be updated with the new ones that passed to
params
- get_default_params()[source]#
Returns the default set of train parameters. The default set of parameters will be used when
params=None
- get_feature_importance()[source]#
Returns the feature importance of the trained booster based on the given
importance_type
- feature_importance_#
Features importance based on the given
importance_type
- Type:
pd.DataFrame
- scaler_#
Standardization object when
scale_mean=True
orscale_std=True
unless it isNone
- Type:
StandardScaler, optional
- X_train_#
Fitted and Transformed features when
scale_mean=True
orscale_std=True
. In other case, it will be the same as the passedX_train
features- Type:
pd.DataFrame
- X_test_#
Transformed features when
scale_mean=True
orscale_std=True
using clf.scaler_ that has be fitted onX_train
andy_train
data. In other case, it will be the same as the passedX_train
features- Type:
pd.DataFrame
- dtrain_#
Training data matrix via
xgboost.DMatrix(clf.X_train_, clf.y_train)
- Type:
xgb.DMatrix
- dtest_#
Testing data matrix via
xgboost.DMatrix(clf.X_test_, clf.y_test)
orxgboost.DMatrix(clf.X_test_, None)
wheny_test
is not available in inference- Type:
xgb.DMatrix
- shap_values_train_#
Shapley values from
TreeExplainer
usingX_train_
- Type:
np.ndarray
- shap_values_test_#
Shapley values from
TreeExplainer
usingX_test_
- Type:
np.ndarray
- shap_explainer_#
Shap TreeExplainer object
- Type:
shap.TreeExplainer
- model_#
XGBoost Booster object
- Type:
xgboost.Booster
References
[markers-api]- __slots__ = []#
- importance_type :Optional[str] = total_gain#
- num_boost_round :Optional[int] = 200#
- params :Optional[Dict[str, Union[str, float, int]]]#
- scale_mean :Optional[bool] = False#
- scale_std :Optional[bool] = False#
- sparse_matrix :Optional[bool] = False#
- __getstate__()#
- __repr__(N_CHAR_MAX=700)#
Return repr(self).
- __setstate__(state)#
- fit(X_train: Union[pandas.DataFrame, numpy.ndarray], y_train: Union[List[float], numpy.ndarray, pandas.Series]) None [source]#
Fits a
XGBoost.Booster
to input training data.Notes
Proper
dtrain_
matrix based on chosen options i.e.sparse_matrix
,scale_mean
,scale_std
is being created based on the passedX_train
andy_train
.- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)
y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)
See also
xgboost.train()
- Returns:
None
- get_default_params() Dict[str, Union[str, float, int]] [source]#
Returns the default set of train parameters.
The default set of parameters will be used when
params=None
.See also
- Returns:
Dict[str, Union[str, float, int]]
- get_feature_importance() pandas.DataFrame [source]#
Returns the feature importance of the trained booster based on the given
importance_type
.- Returns:
pd.DataFrame
- get_params() Optional[Dict[str, Union[str, float, int]]] [source]#
Returns the final set of train parameters.
The default set of parameters will be updated with the new ones that passed to
params
.See also
- Returns:
Dict[str, Union[str, float, int]]
- get_shap_explainer() shap.TreeExplainer [source]#
Returns the
shap.TreeExplainer
object.- Returns:
shap.TreeExplainer
- plot_feature_importance(figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 5), color: Optional[str] = '#87CEEB', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 10, markeredgecolor: Optional[str] = '#1F77B4', markerfacecolor: Optional[str] = '#1F77B4', markeredgewidth: Optional[Union[int, float]] = 1, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the XGBoost feature importance as bar chart.
- Parameters:
feature importance (pd.DataFrame) – Feature importance (
feature_importance_
attribute)figsize (Tuple[Union[int, float], Union[int, float]], optional) – Figure size, by default (8, 5)
color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 10
markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”
markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- plot_shap_summary(validation: Optional[bool] = True, plot_type: Optional[str] = 'dot', figsize: Optional[Union[str, Tuple[float, float]]] = 'auto', color: Optional[str] = None, cmap: Optional[matplotlib.colors.LinearSegmentedColormap] = None, max_display: Optional[int] = 20, feature_names: Optional[List[str]] = None, layered_violin_max_num_bins: Optional[int] = 10, title: Optional[str] = None, sort: Optional[bool] = True, color_bar: Optional[bool] = True, class_names: Optional[List[str]] = None, class_inds: Optional[List[int]] = None, color_bar_label: Optional[str] = 'Feature Value', save_path: Optional[str] = None, display_plot: Optional[bool] = True) None [source]#
Visualizes shap beeswarm plot as summary of shapley values.
Notes
This is a helper function to plot the
shap
summary plot based on all types ofshap.Explainer
includingshap.LinearExplainer
for linear models,shap.TreeExplainer
for tree-based models, andshap.DeepExplainer
deep neural network models. More on details are available at [shap-api]. Note that this function should be ran after thepredict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Trueplot_type (str, optional) – The type of summary plot where possible options are “bar”, “dot”, “violin”, “layered_violin”, and “compact_dot”. Recommendations are “dot” for single-output such as binary classifications, “bar” for multi-output problems, “compact_dot” for Shap interactions, by default “dot”
figsize (tuple, optional) – Figure size where “auto” is auto-scaled figure size based on the number of features that are being displayed. Passing a single float will cause each row to be that many inches high. Passing a pair of floats will scale the plot by that number of inches. If None is passed then the size of the current figure will be left unchanged, by default “auto”
color (str, optional) – Color of plots when
plot_type="violin"
andplot_type=layered_violin"
are “RdBl” color-map while color of the horizontal lines whenplot_type="bar"
is “#D0AAF3”, by default Nonecmap (LinearSegmentedColormap, optional) – Color map when
plot_type="violin"
andplot_type=layered_violin"
, by default “RdBl”max_display (int, optional) – Limit to show the number of features in the plot, by default 20
feature_names (List[str], optional) – List of feature names to pass. It should follow the order of features, by default None
layered_violin_max_num_bins (int, optional) – The number of bins for calculating the violin plots ranges and outliers, by default 10
title (str, optional) – Title of the plot, by default None
sort (bool, optional) – Flag to plot sorted shap vlues in descending order, by default True
color_bar (bool, optional) – Flag to show a color bar when
plot_type="dot"
orplot_type="violin"
class_names (List[str], optional) – List of class names for multi-output problems, by default None
class_inds (List[int], optional) – List of class indices for multi-output problems, by default None
color_bar_label (str, optional) – Label for color bar, by default “Feature Value”
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
- Returns:
None
- plot_shap_waterfall(validation: Optional[bool] = True, figsize: Optional[Tuple[float, float]] = (8, 5), bar_color: Optional[str] = '#B3C3F3', bar_thickness: Optional[Union[float, int]] = 0.5, line_color: Optional[str] = 'purple', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 7, markeredgecolor: Optional[str] = 'purple', markerfacecolor: Optional[str] = 'purple', markeredgewidth: Optional[Union[int, float]] = 1, max_display: Optional[int] = 20, title: Optional[str] = None, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure] [source]#
Visualizes the Shapley values as a waterfall plot.
Notes
Waterfall is defined as the cumulitative/composite ratios of shap values per feature. Therefore, it can be easily seen with each feature how much explainability we can achieve. Note that this function should be ran after the
predict_proba()
to make sure theX_test
is being instansiated or setvalidation=False
.- Parameters:
validation (bool, optional) – Whether to calculate Shap values of using the validation data
X_test
or not. Whenvalidation=False
, Shap values are calculated usingX_train
, be default Truefigsize (Tuple[float, float], optional) – Figure size, by default (8, 5)
bar_color (str, optional) – Color of the horizontal bar lines, “#B3C3F3”
bar_thickness (Union[float, int], optional) – Thickness (hight) of the horizontal bar lines, by default 0.5
line_color (str, optional) – Color of the line plot, by default “purple”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 7
markeredgecolor (str, optional) – Marker edge color, by default “purple”
markerfacecolor (str, optional) – Marker face color, by default “purple”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
max_display (int, optional) – Limit to show the number of features in the plot, by default 20
title (str, optional) – Title of the plot, by default None
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- predict(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None, threshold: Optional[float] = 0.5) numpy.ndarray [source]#
Returns the prediction classes based on the threshold.
Notes
The default
threshold=0.5
might not give you the best results while you can find the optimum thresholds based on different algorithms including Youden Index, maximizing the area under sensitivity-specificity curve, and maximizing the area under precision-recall curve by usingBinaryClassificationMetrics
.- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
threshold (float, optional) – Inclusive threshold value to binarize
y_pred_proba_
toy_pred_
where any value that satisfiesy_pred_prob_ >= threshold
will set toclass=1 (positive class)
. Note that for">="
is used instead of">"
, by default 0.5
- Returns:
np.ndarray
- predict_proba(X_test: Union[pandas.DataFrame, numpy.ndarray], y_test: Optional[Union[List[float], numpy.ndarray, pandas.Series]] = None) numpy.ndarray [source]#
Returns the prediction probabilities for the positive class.
Notes
predict_proba()
only reports the probability of the positive class, while the sklearn API returns for both and slicing likepred_proba[:, 1]
is needed for positive class predictions. Additionally,y_test
is optional while the targets might not be available in validiation (inference).- Parameters:
X_test (Union[pd.DataFrame, np.ndarray]) – Input data for testing (features)
y_test (Union[List[float], np.ndarray, pd.Series], optional) – Input ground truth for testing (targets)
- Returns:
np.ndarray
- score(X, y, sample_weight=None)#
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score (float) – Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.