Execute

Once a resource is present in your inventory (see Managing resources), ReproMan provides a few ways to execute command(s) on the resource. The first is to request an interactive shell for a resource with reproman login. Another is to use reproman execute, which is suitable for running one-off commands on the resource (though, as its manpage indicates, it’s capable of a bit more). To some degree, you can think of login and execute as analogous to ssh HOST and ssh HOST COMMAND, respectively, where the ReproMan variants provide a common interface across resource types.

The final way to execute a command is reproman run.

Run

reproman run is concerned with three high-level tasks:

  1. Starting from a call on the local machine, prepare the remote resource for command execution (e.g., copying input files to the remote).
  2. Execute the command on the remote resource, typically through a batch system.
  3. Fetch the results to the local machine. The results include command output as well as information about the execution (e.g., batch system submit files).

Reference example

Let’s first establish a simple example that we can reference as we cover some of the details. In a terminal, we’re visiting a DataLad dataset where the working tree looks like this:

.
|-- clean.py
`-- data
    |-- f0.csv -> ../.git/annex/objects/[...]
    `-- f1.csv -> ../.git/annex/objects/[...]

The clean.py script takes two positional arguments (e.g., ./clean.py data/f0.csv cleaned/f0.csv), where the first is a data file to process and the second is a path to write the output (creating directories if necessary).

Note

Although DataLad is not a strict requirement, having it installed on at least the local machine is strongly recommended, and without it only a limited set of functionality is available. If you are new to DataLad, consider reading the DataLad handbook.

Choosing an orchestrator

Before running a command, we need to decide on an orchestrator. The orchestrator is responsible for the first and third tasks above, preparing the remote and fetching the results. The complete set of orchestrators, accompanied by descriptions, can be seen by calling reproman run --list=orchestrators.

The main orchestrator choices are datalad-pair, datalad-pair-run, and datalad-local-run. If the remote has DataLad available, you should go with one of the datalad-pair* orchestrators. These will sync your local dataset with a dataset on the remote machine (using datalad push), creating one if it doesn’t already exist (using datalad create-sibling).

datalad-pair differs from the datalad-*-run orchestrators in the way it captures results. After execution has completed, datalad-pair commits the result on the remote via DataLad. On fetch, it will pull that commit down with datalad update. Outputs (specified via --outputs or as a job parameter) are retrieved with datalad get.

datalad-pair-run and datalad-local-run, on the other hand, determine a list of output files based on modification times and packages these files in a tarball. (This approach is inspired by datalad-htcondor.) On fetch, this tarball is downloaded locally and used to create a datalad run commit in the local repository.

There is one more orchestrator, datalad-no-remote, that is designed to work only with a local shell resource. It is similar to datalad-pair, except that the command is executed in the same directory from which reproman run is invoked.

Revisiting our concrete example and assuming we have an SSH resource named “foo” in our inventory, here’s how we could specify that the datalad-pair-run orchestrator should be used:

$ reproman run --resource foo \
  --orc datalad-pair-run --input data/f0.csv \
  ./clean.py data/f0.csv cleaned/f0.csv

Notice that in addition to the orchestrator, we specify the input file that needs to be available on the remote. This is only necessary for files that are tracked by git-annex. Files tracked by Git do not need to be declared as inputs because the same revision of the dataset is checked out on the remote.

Warning

The orchestration with DataLad datasets is work in progress, with some rough edges. You might end up in a state that ReproMan doesn’t know how to sync. Please report any issues you encounter on the issue tracker .

Choosing a submitter

Another, easier decision is which submitter to use. This comes down to which, if any, batch system your remote resource supports. The currently available options are pbs, condor, or local. With local, the job is executed directly through sh rather than submitted to a batch system.

Our last example invocation could be extended to use Condor like so:

$ reproman run --resource foo \
   --sub condor \
   --orc datalad-pair-run --input data/f0.csv \
  ./clean.py data/f0.csv cleaned/f0.csv

Note that which batch systems are currently supported is mostly a matter of which systems ReproMan developers currently have at their disposal. If you would like to add support for your system (or have experience with more general approach like DRMAA), we’d welcome help in this area.

Detached jobs

By default, when a run command is executed, it submits the job, registers it locally, and exits. The registered jobs can be viewed and managed with reproman jobs. To list all jobs, run reproman jobs without any arguments. To fetch a completed job back into the local dataset, call reproman jobs NAME, where NAME is a substring of the job ID that uniquely identifies the job.

In cases where you prefer run to stay attached and fetch the job when it is finished, pass the --follow argument to reproman run.

Concurrent subjobs

If you’re submitting a job to a batch system, it’s likely that you want to submit concurrent subjobs. To continue with the toy example from above, you’d want to have two jobs, each one running clean.py on a different input file.

reproman run has two options for specifying subjobs: --batch-parameter and --batch-spec. The first can work for simple cases, like our example:

$ reproman run --resource foo --sub condor --orc datalad-pair-run \
  --batch-parameter name=f0,f1 \
  --input 'data/{p[name]}.csv'  \
  ./clean.py data/{p[name]}.csv cleaned/{p[name]}.csv

A subjob will be created for each name value, with any {p[name]} field in the input, output, and command strings formatted with the value. In this case, the two commands executed on the remote would be

./clean.py data/f0.csv cleaned/f0.csv
./clean.py data/f1.csv cleaned/f1.csv

The --batch-spec option is the more cumbersome but more flexible counterpart to --batch-parameter. Its value should point to a YAML file that defines a series of records, each one with all of the parameters for a single subjob command. The equivalent of --batch-parameter name=f0,f1 would be a YAML file with the following content:

- name: f0
- name: f1

Warning

When there is more than one subjob, *-run orchestrators do not create a valid run commit. Specifically, datalad rerun could not be used to rerun the commit on the local machine because the values for the inputs, outputs, and command do not correspond to concrete values. This is an unresolved issue, but at this point the commit should be considered as a way to capture the information about the remote command execution—one that certainly provides more information than logging into the remote and running condor_submit yourself.

Job parameters

To define a job, ReproMan builds up a “job spec” from job parameters. Call reproman run --list=parameters to see a list of available parameters. The parameters can be specified within a file passed to the --job-spec option, as a key-value pair specified via the --job-parameter option, or through a dedicate command-line option.

The last option is only available for a subset of parameters, with the intention of giving these parameters more exposure and making them slightly more convenient to use. In the examples so far, we’ve only seen job parameters in the form of a dedicated command-line argument, things like --orc datalad-pair-run. Alternatively this could be expressed more verbosely through --job-parameter as --job-parameter orchestrator=datalad-pair-run. Or it could be contained as a top-level key-value pair in a YAML file passed to --job-spec.

When a value is specified in multiple sources, the order of precedence is the dedicated option, then the value specified via --job_parameters, and finally the value contained in a --job-spec YAML file. When multiple --job-spec arguments are given and define a conflicting key, the value from the last specified file wins.

Captured job information

When using any DataLad-based orchestrator, the run will ultimately be captured as a commit in the dataset. In addition to working tree changes that the command caused (e.g., files it generated), the commit will include new files under a .reproman/jobs/<resource name>/<job ID>/ directory. Of the files from that directory, the ones described below are likely to be of the most interest to callers.

submit
The batch system submit file (e.g., when the submitter is condor, the file passed to condor_submit).
runscript
The wrapper script called by the submit file. It runs the subjob command indicated by its sole command-line argument, an integer that represents the subjob.
std{out,err}.N
The standard output and standard error for each subjob command. If subjob N, stderr.N is where you should look first for more information.
spec.yaml

The “job spec” mentioned in the last section. Any key that does not start with an underscore is a job parameter that can be specified by the caller.

In addition to recording information about the submitted job, this spec can provide a starting point for future reproman run calls. You can copy it to a new file, tweak it as desired, and feed it in via --job-spec. Or, instead of copying the file, you can give the original file to --job-spec and then override the values as needed with command-line arguments or later --job-spec values.