Interface to the DIRAC grid#
This part of the documentation covers the set of commands provided by the protopipe-grid-interface
Note
This package is currently stand-alone. It is planned to merge it into protopipe as a new module protopipe.dirac.
The most important commands provided by this interface are listed here,
There are secondary scripts there are mainly used for debugging. Please refer to their help (command_name -h) in case you need to use them.
Create a new analysis#
protopipe-CREATE_ANALYSIS
assumes you work from a common parent directory named shared_folder (required) which will host both
the analyses directory and a productions directory.
Note
The name of the parent directory could change, it was named like this due to the fact that previous versions of the interface required a containerized solution and the naming was clearer in that context.
usage: protopipe-CREATE_ANALYSIS [-h] --analysis_name ANALYSIS_NAME [--analysis_directory_tree ANALYSIS_DIRECTORY_TREE] [--log_file LOG_FILE] [--output_path OUTPUT_PATH] [--GRID-is-DIRAC]
[--GRID-home GRID_HOME] [--GRID-path-from-home GRID_PATH_FROM_HOME] [--overwrite-analysis]
Create a directory structure for a CTA data analysis using the protopipe prototype pipeline.
WARNING: check that the version of protopipe is the one you intend to use!
optional arguments:
-h, --help show this help message and exit
--analysis_name ANALYSIS_NAME
Name of the analysis
--analysis_directory_tree ANALYSIS_DIRECTORY_TREE
Analysis workflow YAML file (default: see protopipe_grid_interface.aux)
--log_file LOG_FILE Override log file path
(default: analysis.log in analysis folder)
--output_path OUTPUT_PATH
Full path where the 'shared_folder' should be created (or where it already exists)
--GRID-is-DIRAC The grid on which to run the analysis is the DIRAC grid.
--GRID-home GRID_HOME
Path of the user's home on grid (if DIRAC, /vo.cta.in2p3.fr/user/x/xxx).
--GRID-path-from-home GRID_PATH_FROM_HOME
Path of the analysis on the DIRAC grid. Defaults for empty string (user home)
--overwrite-analysis Overwrite analysis folder (WARNING: you could loose data!)
Important
protopipe-CREATE_ANALYSIS
creates an analysis_metadata.yaml
file with the provided input information.
Many commands benefit from using this file to retrieve such information at runtime so you don’t have to
input it again.
It can be also useful to retrieve data and configuration files from the DIRAC file catalog if you want to
work on someone’s analysis.
It will also create an analysis.log
file which will be filled anytime you use the main commands,
unless another location is specified.
Split datasets#
After you created a new analysis, head over to the list of available simulation datasets (either from the CTA wiki or using CTADIRAC tools) and retrieve the original lists of simtel files per particle type.
With protopipe-SPLIT_DATASET
you can split these lists in sub-datasets to be stored under the
data/simtel folder within your analysis.
usage: protopipe-SPLIT_DATASET [-h] [--metadata METADATA | --output_path OUTPUT_PATH] [--input_gammas INPUT_GAMMAS] [--input_protons INPUT_PROTONS] [--input_electrons INPUT_ELECTRONS]
[--split_gammas SPLIT_GAMMAS [SPLIT_GAMMAS ...]] [--split_protons SPLIT_PROTONS [SPLIT_PROTONS ...]] [--split_electrons SPLIT_ELECTRONS [SPLIT_ELECTRONS ...]]
[--log_file LOG_FILE]
Split a simulation dataset.
Requirement:
- list files should have *.list extension.
Default analysis workflow (see protopipe/aux/standard_analysis_workflow.yaml):
- a training sample for energy made of gammas,
- a training sample for particle classification made of gammas and protons,
- a performance sample made of gammas, protons, and electrons.
optional arguments:
-h, --help show this help message and exit
--metadata METADATA Analysis metadata file produced at creation
(recommended).
--output_path OUTPUT_PATH
Specifiy an output directory
--input_gammas INPUT_GAMMAS
Full path of the original list of gammas.
--input_protons INPUT_PROTONS
Full path of the original list of protons.
--input_electrons INPUT_ELECTRONS
Full path of the original list of electrons.
--split_gammas SPLIT_GAMMAS [SPLIT_GAMMAS ...]
List of percentage values in which to split the gammas. Default is [10,10,80]
--split_protons SPLIT_PROTONS [SPLIT_PROTONS ...]
List of 3 percentage values in which to split the protons. Default is [40,60]
--split_electrons SPLIT_ELECTRONS [SPLIT_ELECTRONS ...]
List of percentage values in which to split the electrons. Default is [100]
--log_file LOG_FILE Override log file path
(ignored when using metadata)
Submit a set of jobs#
protopipe-SUBMIT-JOBS
needs always a valid grid.yaml configuration file.
Usage:
protopipe-SUBMIT-JOBS [options] ...
General options:
-o --option <value> : Option=value to add
-s --section <value> : Set base section for relative parsed options
-c --cert <value> : Use server certificate to connect to Core Services
-d --debug : Set debug mode (-ddd is extra debug)
- --cfg= : Load additional config file
- --autoreload : Automatically restart if there's any change in the module
- --license : Show DIRAC's LICENSE
-h --help : Shows this help
Options:
- --analysis_path= : Full path to the analysis folder
- --output_type= : Output data type (TRAINING or DL2)
- --max_events= : Max number of events to be processed (optional, int)
- --upload_analysis_cfg= : If True (default), upload analysis configuration file
- --dry= : If True do not submit job (default: False)
- --test= : If True submit only one job (default: False)
- --save_images= : If True save images together with parameters (default: False)
- --debug_script= : If True save debug information during execution of the script (default: False)
- --DataReprocessing= : If True reprocess data from one site to another (default: False)
- --tag= : Used only if DataReprocessing is True; only sites tagged with tag will be considered (default: None)
- --log_file= : Override log file path (default: analysis.log in analysis folder)
Download data and merge it#
protopipe-DOWNLOAD_AND_MERGE
allows to download data serially, but also in rsync
-style
(this second option is used as a backup crosscheck automatically in fact, to overcome any
network malfunction) and merge it in a single HDF5 file.
Warning
Currently the merging is done locally, so both single files and merged file will be stored locally resulting in a two-fold amount of disk-space usage. This is of course not ideal: data should be merged on the grid.
usage: protopipe-DOWNLOAD_AND_MERGE [-h] [--metadata METADATA] [--disable_download] [--disable_sync] [--disable_merge] [--indir INDIR] [--outdir OUTDIR] --data_type
{TRAINING/for_energy_estimation,TRAINING/for_particle_classification,DL2} --particle_types [{gamma,proton,electron} [{gamma,proton,electron} ...]] [--n_jobs N_JOBS]
[--cleaning_mode CLEANING_MODE] [--GRID-home GRID_HOME] [--GRID-path-from-home GRID_PATH_FROM_HOME] [--analysis_name ANALYSIS_NAME] [--local_path LOCAL_PATH]
[--log_file LOG_FILE]
Download and merge data from the DIRAC grid.
The default behaviour calls an rsync-like command after the normal download as
an additional check.
This script can be used separately, or in association with an analysis workflow.
In the second case the recommended usage is via the metadata file produced at creation.
optional arguments:
-h, --help show this help message and exit
--metadata METADATA Path to the metadata file produced at analysis creation
(recommended - if None, specify necessary information).
--disable_download Do not download files serially
--disable_sync Do not syncronyze folders after serial download
--disable_merge Do not merge files at the end
--indir INDIR Override input directory
--outdir OUTDIR Override output directory
--data_type {TRAINING/for_energy_estimation,TRAINING/for_particle_classification,DL2}
Type of data to download and merge
--particle_types [{gamma,proton,electron} [{gamma,proton,electron} ...]]
One of more particle type to download and merge
--n_jobs N_JOBS Number of parallel jobs for directory syncing (default: 4)
--cleaning_mode CLEANING_MODE
Deprecated argument
--GRID-home GRID_HOME
Path of the user's home on DIRAC grid (/vo.cta.in2p3.fr/user/x/xxx)
(recommended: use metadata file)
--GRID-path-from-home GRID_PATH_FROM_HOME
optional additional path from user's home in DIRAC (recommended: use metadata file)
--analysis_name ANALYSIS_NAME
Name of the analysis (recommended: use metadata file)
--local_path LOCAL_PATH
Path where shared_folder is located (recommended: use metadata file)
--log_file LOG_FILE Override log file path
(default: analysis.log in analysis folder)
Upload models#
protopipe-UPLOAD_MODELS
allows you to upload both model files and their configuration files
to the DIRAC file catalog.
It allows to define a list of storage elements (SEs) where to store this data.
It will always try to upload to CC-IN3P3 first.
usage: protopipe-UPLOAD_MODELS [-h] [--metadata METADATA] --cameras CAMERAS [CAMERAS ...] --model_type {regressor,classifier} --model_name {RandomForestRegressor,AdaBoostRegressor,RandomForestClassifier}
[--cleaning_mode CLEANING_MODE] [--GRID-home GRID_HOME] [--GRID-path-from-home GRID_PATH_FROM_HOME] [--list-of-SEs [LIST_OF_SES [LIST_OF_SES ...]]] [--analysis_name ANALYSIS_NAME]
[--local_path LOCAL_PATH] [--log_file LOG_FILE]
Upload models produced with protopipe to the Dirac grid.
Files will be uploaded at least on CC-IN2P3-USER.
You can use `cta-prod-show-dataset YOUR_DATASET_NAME --SEUsage` to know
on which Dirac Storage Elements to replicate you models.
Note: you will see *-Disk entries, but you need to replicate using *-USER entries.
The default behaviour is replicate them to "DESY-ZN-USER", "CNAF-USER", "CEA-USER".
Replication is optional.
optional arguments:
-h, --help show this help message and exit
--metadata METADATA Path to the metadata file produced at analysis creation
(recommended - if None, specify necessary information).
--cameras CAMERAS [CAMERAS ...]
List of cameras to consider
--model_type {regressor,classifier}
Type of model to upload
--model_name {RandomForestRegressor,AdaBoostRegressor,RandomForestClassifier}
Type of model to upload
--cleaning_mode CLEANING_MODE
Deprecated argument
--GRID-home GRID_HOME
Path of the user's home on grid: /vo.cta.in2p3.fr/user/x/xxx (recommended: use metadata file).
--GRID-path-from-home GRID_PATH_FROM_HOME
Analysis path on DIRAC grid (defaults to user's home; recommended: use metadata file)
--list-of-SEs [LIST_OF_SES [LIST_OF_SES ...]]
List of DIRAC Storage Elements which will host the uploaded models
--analysis_name ANALYSIS_NAME
Name of the analysis (recommended: use metadata file)
--local_path LOCAL_PATH
Path where shared_folder is located (recommended: use metadata file)
--log_file LOG_FILE Override log file path
(default: analysis.log in analysis folder)