Multivariate analysis (protopipe.mva)#

Introduction#

protopipe.mva contains utilities to build models for regression and classification. It is based on machine learning methods available in scikit-learn. Internally, the tables are dealt with the Pandas Python module.

For each type of camera a regressor/classifier should be trained.

For both type of models an average of the image estimates is computed during the Data training and/or Production of DL2 data steps to determine a global output for the event (energy or score/gammaness).

Details#

Data is split in train and test subsamples by single telescope images.

The class `TrainModel` uses a training sample composed of

  • signal for a regression model,

  • signal and background for a classifier.

In the default analysis workflow, signal is composed of gamma-rays while background by protons.

The training of a model can be done also via the GridSearchCV algorithm which allows to find the best hyper-parameters of the models.

Currently tested models:

  • sklearn.ensemble.RandomForestClassifier

  • sklearn.ensemble.RandomForestRegressor

  • sklearn.ensemble.AdaBoostRegressor

For details about the generation of each model type, please refer to Building the models.

Reference/API#

protopipe.mva Package#

Classes to build models based on machine learning methods.

Functions#

get_evt_model_output(data_dict[, ...])

Returns DataStore with reco energy + score/target columns of model at the level-event.

get_evt_subarray_model_output(data[, ...])

Returns DataStore with keepcols + score/target columns of model at the level-subarray-event.

initialize_script_arguments()

Initialize the parser of protopipe.scripts.build_model.

make_cut_list(cuts)

plot_distributions(feature_list, data_list)

Plot feature distributions for several data set.

plot_hist(ax, data, nbin, limit[, norm, ...])

Utility function to plot histogram

plot_profile(ax, data, xcol, ycol, nbin, limit)

Plot profile of a distribution

plot_roc_curve(ax, model_output, y, **kwargs)

Plot ROC curve for a given set of model outputs and labels

prepare_data(ds, derived_features, cuts[, ...])

Add custom variables to the input data and optionally select it.

save_output(models, cam_id, factory, ...)

Save model and data used to produce it per camera-type.

split_train_test(survived_images, ...)

Split the data selected for cuts in train and test samples.

Classes#

TrainModel(case, feature_name_list[, ...])

Train classification or regressor model.