Data training#
protopipe.scripts.data_training
is used to build tables of reconstructed image
and shower geometry parameters (that in protopipe constitutes generally the TRAINING
format).
This type of data can be used to train energy and particle classification
estimators for each camera type.
Note
In the default analysis workflow, the particle classification model
uses the estimate of the particle’s energy as one of the parameters.
When training the data for that model you will need to specify the boolean
estimate_energy
parameter as well as the directory where the model is
saved via the regressor_dir
option.By invoking the help argument, you can get help about how the script works:
usage: protopipe-TRAINING [-h] --config_file CONFIG_FILE -o OUTFILE [-m MAX_EVENTS] -i INDIR -f [INFILE_LIST [INFILE_LIST ...]]
[--cam_ids [CAM_IDS [CAM_IDS ...]]] [--wave_dir WAVE_DIR] [--wave_temp_dir WAVE_TEMP_DIR] [--wave | --tail]
[--debug] [--show_progress_bar] [--save_images] [--estimate_energy ESTIMATE_ENERGY]
[--regressor_dir REGRESSOR_DIR] [--regressor_config REGRESSOR_CONFIG]
optional arguments:
-h, --help show this help message and exit
--config_file CONFIG_FILE
-o OUTFILE, --outfile OUTFILE
-m MAX_EVENTS, --max_events MAX_EVENTS
maximum number of events considered per file
-i INDIR, --indir INDIR
Input folder
-f [INFILE_LIST [INFILE_LIST ...]], --infile_list [INFILE_LIST [INFILE_LIST ...]]
give a specific list of files to run on
--cam_ids [CAM_IDS [CAM_IDS ...]]
give the specific list of camera types to run on
--wave_dir WAVE_DIR directory where to find mr_filter. if not set look in $PATH
--wave_temp_dir WAVE_TEMP_DIR
directory where mr_filter to store the temporary fits files
--wave if set, use wavelet cleaning -- default
--tail if set, use tail cleaning, otherwise wavelets
--debug Print debugging information
--show_progress_bar Show information about execution progress
--save_images Save also all images
--estimate_energy ESTIMATE_ENERGY
Estimate the events' energy with a regressor from protopipe.scripts.build_model
--regressor_dir REGRESSOR_DIR
regressors directory
--regressor_config REGRESSOR_CONFIG
Configuration file used to produce regressor model
The configuration file used by this script is analysis.yaml
,
# General informations
# WARNING: the settings recorded here are unstable and used here only to give an example
General:
config_name: "ANALYSIS_NAME" # filled by the GRID interface
production: "" # 'Prod3b' or 'Prod5N'
site: "" # 'north' or 'south'
# 'array' can be either
# - a custom list of telescope IDs
# - 'full_array', 'prod5N_alpha_north', 'prod5N_alpha_south', 'prod5N_alpha_south_NectarCam'
# If a string, you can select a telescope-type subarray by
# adding e.g. '_LST_LST_LSTCam'
# NOTE: the 'full_array' is the total original array without any selection
array: ""
cam_id_list: [] # to upload the required models
Calibration:
apply_integration_correction: true # for CTAMARS-like analysis use false
apply_peak_time_shift: false
apply_waveform_time_shift: false
# factor to transform the integrated charges (in ADC counts) into number of
# photoelectrons (on top of the DC-to-PHE factor)
# the pixel-wise one calculated by simtelarray is 0.92 for CTAMARS
calib_scale: 1.0
ImageCleaning: # NOTE: these are EXAMPLE values
# Use only the biggest cluster of surviving pixels
biggest:
tail: # Cleaning based on the "tailcut" technique
thresholds: # picture, boundary
- LSTCam: [4.0, 2.0]
- NectarCam: [4.0, 2.0]
- FlashCam: [4, 2] # dummy values for reliable unit-testing
- ASTRICam: [4, 2] # dummy values for reliable unit-testing
- DigiCam: [0, 0] # values left unset for future studies
- CHEC: [0, 0] # values left unset for future studies
- SCTCam: [0, 0] # values left unset for future studies
keep_isolated_pixels: False
min_number_picture_neighbors: 1
wave: # Cleaning based on the "wavelets" technique
# Directory to write temporary files
#tmp_files_directory: '/dev/shm/'
tmp_files_directory: "./"
options:
LSTCam:
type_of_filtering: "hard_filtering"
filter_thresholds: [3, 0.2]
last_scale_treatment: "drop"
kill_isolated_pixels: True
detect_only_positive_structures: False
clusters_threshold: 0
NectarCam: # TBC
type_of_filtering: "hard_filtering"
filter_thresholds: [3, 0.2]
last_scale_treatment: "drop"
kill_isolated_pixels: True
detect_only_positive_structures: False
clusters_threshold: 0
# Use all clusters of surviving pixels
extended:
tail: # Cleaning based on the "tailcut" technique
thresholds: # picture, boundary
- LSTCam: [4.0, 2.0]
- NectarCam: [4.0, 2.0]
- FlashCam: [4, 2] # dummy values for reliable unit-testing
- ASTRICam: [4, 2] # dummy values for reliable unit-testing
- DigiCam: [0, 0] # values left unset for future studies
- CHEC: [0, 0] # values left unset for future studies
- SCTCam: [0, 0] # values left unset for future studies
keep_isolated_pixels: False
min_number_picture_neighbors: 1
wave: # Cleaning based on the "wavelets" technique
# Directory to write temporary files
#tmp_files_directory: '/dev/shm/'
tmp_files_directory: "./"
options:
LSTCam:
type_of_filtering: "hard_filtering"
filter_thresholds: [3, 0.2]
last_scale_treatment: "posmask"
kill_isolated_pixels: True
detect_only_positive_structures: False
clusters_threshold: 0
NectarCam: # TBC
type_of_filtering: "hard_filtering"
filter_thresholds: [3, 0.2]
last_scale_treatment: "posmask"
kill_isolated_pixels: True
detect_only_positive_structures: False
clusters_threshold: 0
# Image selection cuts
# NOTE: these are EXAMPLE values
ImageSelection:
source: "extended" # biggest or extended
charge: [50., 1e10]
pixel: [3, 1e10]
ellipticity: [0.1, 0.6]
nominal_distance: [0., 0.8] # in camera radius
# Minimal number of telescopes to consider events
Reconstruction:
# for events with <2 LST images the single-LST image is removed
# before shower geometry
LST_stereo: True
# after this we check if the remaining images satisfy the min_tel condition
min_tel: 2 # any tel_type
# Parameters for energy estimation
EnergyRegressor:
# Name of the regression method (e.g. AdaBoostRegressor, etc.)
method_name: "RandomForestRegressor"
estimation_weight: "CTAMARS" # CTAMARS == 1/RMS^2 (RMS from the RF trees)
# Parameters for g/h separation
GammaHadronClassifier:
# Name of the classification method (e.g. AdaBoostRegressor, etc.)
method_name: "RandomForestClassifier"
# Use probability output or score
use_proba: True
estimation_weight: "hillas_intensity**0.54" # empirical value from CTAMARS