Building the models#

With protopipe it is possible to build estimation models for both particle energy and gamma/hadron classification. The base classes are defined in the protopipe.mva module (see Multivariate analysis (protopipe.mva)).

For both cases, building a regressor or a classifier, the script protopipe.scripts.build_model.py is used.

The following is the help output which shows required arguments and options.

usage: protopipe-MODEL [-h] --config_file CONFIG_FILE [--max_events MAX_EVENTS] [--wave | --tail]
                      (--cameras_from_config | --cameras_from_file | --cam_id_list CAM_ID_LIST) [--indir_signal INDIR_SIGNAL]
                      [--indir_background INDIR_BACKGROUND] [--infile_signal INFILE_SIGNAL] [--infile_background INFILE_BACKGROUND]
                      [-o OUTDIR]

Build model for regression/classification

optional arguments:
  -h, --help            show this help message and exit
  --config_file CONFIG_FILE
  --max_events MAX_EVENTS
                        maximum number of events to use
  --wave                if set, use wavelet cleaning
  --tail                if set, use tail cleaning (default), otherwise wavelets
  --cameras_from_config
                        Get cameras configuration file (Priority 1)
  --cameras_from_file   Get cameras from input file (Priority 2)
  --cam_id_list CAM_ID_LIST
                        Select cameras like 'LSTCam CHEC' (Priority 3)
  --indir_signal INDIR_SIGNAL
                        Directory containing the required SIGNAL input file(s) (default: read from config file)
  --indir_background INDIR_BACKGROUND
                        Directory containing the required BACKGROUND input file(s) (default: read from config file)
  --infile_signal INFILE_SIGNAL
                        SIGNAL file (default: read from config file)
  --infile_background INFILE_BACKGROUND
                        BACKGROUND file (default: read from config file)
  -o OUTDIR, --outdir OUTDIR

The script takes along its arguments a configuration file which depends on what type of model needs to be built.

The available choices can be found under protopipe.aux.example_config_files:

AdaBoostRegressor.yaml is used to train an energy regressor,
RandomForestRegressor.yaml is used to train an energy regressor,
RandomForestClassifier.yaml is used to train a gamma/hadron classifier.

Energy regressor#

To build this you will need at least one table with at least the true energy (the target) and some event characteristics (the features) to reconstruct the energy. This table is created in the Data training step.

The following is a commented example of the required configuration file AdaBoostRegressor.yaml with similar options as for RandomForestRegressor.yaml,

General:
  # [...] = analysis full path (on the host if you are using a container)
  data_dir_signal: "ANALYSES_DIRECTORY/ANALYSIS_NAME/data/TRAINING/for_energy_estimation/gamma"
  data_sig_file: "TRAINING_energy_tail_gamma_merged.h5"
  outdir: "ANALYSES_DIRECTORY/ANALYSIS_NAME/estimators/energy_regressor"
  # List of cameras to use (protopipe-MODEL help output for other options)
  cam_id_list: []

# If train_fraction is 1, all the TRAINING dataset will be used to train the
# model and benchmarking can only be done from the benchmarking notebook
# TRAINING/benchmarks_DL2_to_classification.ipynb
Split:
  train_fraction: 0.8
  use_same_number_of_sig_and_bkg_for_training: False # Lowest statistics will drive the split

# Optimize the hyper-parameters of the estimator with a grid search
# If True parameters should be provided as lists
# If False the model used will be the one based on the chosen single-valued hyper-parameters
GridSearchCV:
  use: False # True or False
  # if False the following two variables are irrelevant
  scoring: "explained_variance"
  cv: 2 # cross-validation splitting strategy
  refit: True # Refit the estimator using the best found parameters
  verbose: 1 # 1,2,3,4
  njobs: -1 # int or -1 (all processors)

Method:
  name: "sklearn.ensemble.AdaBoostRegressor"
  target_name: "true_energy"
  log_10_target: True # this makes the model use log10(target_name)
  # Please, see scikit-learn's API for what each parameter means
  # NOTE: null == None
  base_estimator:
    name: "sklearn.tree.DecisionTreeRegressor"
    parameters:
      # NOTE: here we set the parameters relevant for sklearn.tree.DecisionTreeRegressor
      criterion: "mse" # "mse", "friedman_mse", "mae" or "poisson"
      splitter: "best" # "best" or "random"
      max_depth: null # null or integer
      min_samples_split: 2 # integer or float
      min_samples_leaf: 1 # int or float
      min_weight_fraction_leaf: 0.0 # float
      max_features: null # null, "auto", "sqrt", "log2", int or float
      max_leaf_nodes: null # null or integer
      min_impurity_decrease: 0.0 # float
      random_state: 0 # null or integer or RandomState
      ccp_alpha: 0.0 # non-negative float
  tuned_parameters:
    n_estimators: 50
    learning_rate: 1
    loss: "linear" # 'linear', 'square' or 'exponential'
    random_state: 0 # int, RandomState instance or None

# List of the features to use to train the model
# You can:
# - comment/uncomment the ones you see here,
# - add new ones here if they can be evaluated with pandas.DataFrame.eval
# - if not you can propose modifications to protopipe.mva.utils.prepare_data
FeatureList:
  Basic: # single-named, they need to correspond to input data columns
    - "h_max" # Height of shower maximum from stereoscopic reconstruction
    - "impact_dist" # Impact parameter from stereoscopic reconstruction
    - "hillas_width" # Image Width
    - "hillas_length" # Image Length
    - "concentration_pixel" # Percentage of photo-electrons in the brightest pixel
    - "leakage_intensity_width_1" # fraction of total Intensity which is contained in the outermost pixels of the camera
  Derived: # custom evaluations of basic features that will be added to the data
    # column name : expression to evaluate using basic column names
    log10_WLS: log10(hillas_width*hillas_length/hillas_intensity)
    log10_intensity: log10(hillas_intensity)
    r_origin: (sqrt((hillas_x - az)**2 + (hillas_y - alt)**2))**2
    phi_origin: arctan2(hillas_y - alt, hillas_x - az)

# These cuts select the input data BEFORE training
SigFiducialCuts:
  - "good_image == 1"
  - "is_valid == True"
  - "hillas_intensity > 0"

Diagnostic:
  # Energy binning (used for reco and true energy)
  energy:
    nbins: 15
    min: 0.0125
    max: 125

g/h classifier#

To build a gamma/hadron classifier you need gamma-ray and proton tables with some features used to discriminate between gamma and hadrons (electrons are handled later as a contamination).

Note

An alternative approach - yet to study - could be to train a classifier with gamma against a background sample composed of weighted hadrons and weighted electrons.

The following the example provided by the example configuration file RandomForestClassifier.yaml,

General:
  # [...] = your analysis full path (on the host if you are using a container)
  data_dir_signal: "ANALYSES_DIRECTORY/ANALYSIS_NAME/data/TRAINING/for_particle_classification/gamma"
  data_dir_background: "ANALYSES_DIRECTORY/ANALYSIS_NAME/data/TRAINING/for_particle_classification/proton"
  data_sig_file: "TRAINING_classification_tail_gamma_merged.h5"
  data_bkg_file: "TRAINING_classification_tail_proton_merged.h5"
  outdir: "ANALYSES_DIRECTORY/ANALYSIS_NAME/estimators/gamma_hadron_classifier"
  # List of cameras to use (protopipe-MODEL help output for other options)
  cam_id_list: []

# If train_fraction is 1.0, all the TRAINING dataset will be used to train the
# model and benchmarking can only be done from the benchmarking notebook
# TRAINING/benchmarks_DL2_to_classification.ipynb
Split:
  train_fraction: 0.8
  use_same_number_of_sig_and_bkg_for_training: False # Lowest statistics will drive the split

# Optimize the hyper-parameters of the estimator with a grid search
# If True parameters should be provided as lists (for None use [null])
# If False the model used will be the one based on the chosen single-valued hyper-parameters
GridSearchCV:
  use: False # True or False
  # if False the following to variables are irrelevant
  scoring: "roc_auc"
  cv: 2 # cross-validation splitting strategy
  refit: True # Refit the estimator using the best found parameters
  verbose: 1 # 1,2,3,4
  njobs: -1 # int or -1 (all processors)

# Definition of the algorithm/method used and its hyper-parameters
Method:
  name: "sklearn.ensemble.RandomForestClassifier" # DO NOT CHANGE
  target_name: "label" # defined between 0 and 1 (DO NOT CHANGE)
  tuned_parameters:
    # Please, see scikit-learn's API for what each parameter means
    # WARNING: null (not a string) == 'None'
    n_estimators: 100 # integer
    criterion: "gini" # 'gini' or 'entropy'
    max_depth: 20 # null or integer
    min_samples_split: 2 # integer or float
    min_samples_leaf: 1 # integer or float
    min_weight_fraction_leaf: 0.0 # float
    max_features: 3 # 'auto', 'sqrt', 'log2', integer or float
    max_leaf_nodes: null # null or integer
    min_impurity_decrease: 0.0 # float
    bootstrap: False # True or False
    oob_score: False # True or False
    n_jobs: null # null or integer
    random_state: 0 # null or integer or RandomState
    verbose: 0 # integer
    warm_start: False # 'True' or 'False'
    class_weight: null # 'balanced', 'balanced_subsample', null, dict or list of dicts
    ccp_alpha: 0.0 # non-negative float
    max_samples: null # null, integer or float
  use_proba: True # If True output is 'gammaness', else 'score'
  calibrate_output: False # If True calibrate model on test data

# List of the features to use to train the model
# You can:
# - comment/uncomment the ones you see here,
# - add new ones here if they can be evaluated with pandas.DataFrame.eval
# - if not you can propose modifications to protopipe.mva.utils.prepare_data
FeatureList:
  Basic: # single-named, they need to correspond to input data columns
    - "h_max" # Height of shower maximum from stereoscopic reconstruction
    - "impact_dist" # Impact parameter from stereoscopic reconstruction
    - "hillas_width" # Image Width
    - "hillas_length" # Image Length
    - "concentration_pixel" # Percentage of photo-electrons in the brightest pixel
  Derived: # custom evaluations of basic features that will be added to the data
    # column name : expression to evaluate using basic column names
    log10_intensity: log10(hillas_intensity)
    log10_reco_energy: log10(reco_energy) # Averaged-estimated energy of the shower
    log10_reco_energy_tel: log10(reco_energy_tel) # Estimated energy of the shower per telescope

# These cuts select the input data BEFORE training
SigFiducialCuts:
  - "good_image == 1"
  - "is_valid == True"
  - "hillas_intensity > 0"

BkgFiducialCuts:
  - "good_image == 1"
  - "is_valid == True"
  - "hillas_intensity > 0"

Diagnostic:
  # Energy binning (used for reco and true energy)
  energy:
    nbins: 4
    min: 0.0125
    max: 200

Warning

The default settings used are not yet optimised for every case.

They have been tuned to get reasonable performance and a good agreeement between the training/test samples.

A first optimisation was reached from a step-by-step comparison against the historical pipeline CTAMARS.

Data training

Production of DL2 data