Execute¶
Once a resource is present in your inventory (see Managing
resources), ReproMan provides a few ways to execute command(s)
on the resource. The first is to request an interactive shell for a
resource with reproman login. Another is to
use reproman execute, which is suitable
for running one-off commands on the resource (though, as its manpage
indicates, it’s capable of a bit more). To some degree, you can think of
login
and execute
as analogous to ssh HOST
and ssh HOST
COMMAND
, respectively, where the ReproMan variants provide a common
interface across resource types.
The final way to execute a command is reproman run.
Run¶
reproman run
is concerned with three high-level tasks:
- Starting from a call on the local machine, prepare the remote resource for command execution (e.g., copying input files to the remote).
- Execute the command on the remote resource, typically through a batch system.
- Fetch the results to the local machine. The results include command output as well as information about the execution (e.g., batch system submit files).
Reference example¶
Let’s first establish a simple example that we can reference as we cover some of the details. In a terminal, we’re visiting a DataLad dataset where the working tree looks like this:
.
|-- clean.py
`-- data
|-- f0.csv -> ../.git/annex/objects/[...]
`-- f1.csv -> ../.git/annex/objects/[...]
The clean.py
script takes two positional arguments (e.g., ./clean.py
data/f0.csv cleaned/f0.csv
), where the first is a data file to process
and the second is a path to write the output (creating directories if
necessary).
Note
Although DataLad is not a strict requirement, having it installed on at least the local machine is strongly recommended, and without it only a limited set of functionality is available. If you are new to DataLad, consider reading the DataLad handbook.
Choosing an orchestrator¶
Before running a command, we need to decide on an orchestrator. The
orchestrator is responsible for the first and third tasks above, preparing the remote and fetching the results. The complete
set of orchestrators, accompanied by descriptions, can be seen by
calling reproman run --list=orchestrators
.
The main orchestrator choices are datalad-pair
,
datalad-pair-run
, and datalad-local-run
. If the remote has
DataLad available, you should go with one of the datalad-pair*
orchestrators.
These will sync your local dataset with a dataset on the remote machine
(using datalad push), creating one if it doesn’t already exist
(using datalad create-sibling).
datalad-pair
differs from the datalad-*-run
orchestrators in the
way it captures results. After execution has completed, datalad-pair
commits the result on the remote via DataLad. On fetch, it will pull
that commit down with datalad update. Outputs (specified via
--outputs
or as a job parameter) are retrieved with datalad get.
datalad-pair-run
and datalad-local-run
, on the other hand,
determine a list of output files based on modification times and
packages these files in a tarball. (This approach is inspired by
datalad-htcondor.) On fetch, this tarball is downloaded locally and
used to create a datalad run commit in the local repository.
There is one more orchestrator, datalad-no-remote
, that is designed
to work only with a local shell resource. It is similar to
datalad-pair
, except that the command is executed in the same
directory from which reproman run
is invoked.
Revisiting our concrete example and assuming we have
an SSH resource named “foo” in our inventory, here’s how we could
specify that the datalad-pair-run
orchestrator should be used:
$ reproman run --resource foo \
--orc datalad-pair-run --input data/f0.csv \
./clean.py data/f0.csv cleaned/f0.csv
Notice that in addition to the orchestrator, we specify the input file that needs to be available on the remote. This is only necessary for files that are tracked by git-annex. Files tracked by Git do not need to be declared as inputs because the same revision of the dataset is checked out on the remote.
Warning
The orchestration with DataLad datasets is work in progress, with some rough edges. You might end up in a state that ReproMan doesn’t know how to sync. Please report any issues you encounter on the issue tracker .
Choosing a submitter¶
Another, easier decision is which submitter to use. This comes down to
which, if any, batch system your remote resource supports. The currently
available options are pbs
, condor
, or local
. With local
,
the job is executed directly through sh
rather than submitted to a
batch system.
Our last example invocation could be extended to use Condor like so:
$ reproman run --resource foo \
--sub condor \
--orc datalad-pair-run --input data/f0.csv \
./clean.py data/f0.csv cleaned/f0.csv
Note that which batch systems are currently supported is mostly a matter of which systems ReproMan developers currently have at their disposal. If you would like to add support for your system (or have experience with more general approach like DRMAA), we’d welcome help in this area.
Detached jobs¶
By default, when a run
command is executed, it submits the job,
registers it locally, and exits. The registered jobs can be viewed and
managed with reproman jobs. To list all jobs,
run reproman jobs
without any arguments. To fetch a completed job
back into the local dataset, call reproman jobs NAME
, where NAME
is a substring of the job ID that uniquely identifies the job.
In cases where you prefer run
to stay attached and fetch the job
when it is finished, pass the --follow
argument to reproman run
.
Concurrent subjobs¶
If you’re submitting a job to a batch system, it’s likely that you want
to submit concurrent subjobs. To continue with the toy example from above, you’d want to have two jobs, each one running
clean.py
on a different input file.
reproman run
has two options for specifying subjobs:
--batch-parameter
and --batch-spec
. The first can work for
simple cases, like our example:
$ reproman run --resource foo --sub condor --orc datalad-pair-run \
--batch-parameter name=f0,f1 \
--input 'data/{p[name]}.csv' \
./clean.py data/{p[name]}.csv cleaned/{p[name]}.csv
A subjob will be created for each name
value, with any {p[name]}
field in the input, output, and command strings formatted with the
value. In this case, the two commands executed on the remote would be
./clean.py data/f0.csv cleaned/f0.csv
./clean.py data/f1.csv cleaned/f1.csv
The --batch-spec
option is the more cumbersome but more flexible
counterpart to --batch-parameter
. Its value should point to a YAML
file that defines a series of records, each one with all of the
parameters for a single subjob command. The equivalent of
--batch-parameter name=f0,f1
would be a YAML file with the following
content:
- name: f0
- name: f1
Warning
When there is more than one subjob, *-run
orchestrators do not
create a valid run commit. Specifically, datalad rerun could not
be used to rerun the commit on the local machine because the values
for the inputs, outputs, and command do not correspond to concrete
values. This is an unresolved issue, but at this point the commit
should be considered as a way to capture the information about the
remote command execution—one that certainly provides more
information than logging into the remote and running
condor_submit
yourself.
Job parameters¶
To define a job, ReproMan builds up a “job spec” from job parameters.
Call reproman run --list=parameters
to see a list of available
parameters. The parameters can be specified within a file passed to the
--job-spec
option, as a key-value pair specified via the
--job-parameter
option, or through a dedicate command-line option.
The last option is only available for a subset of parameters, with the
intention of giving these parameters more exposure and making them
slightly more convenient to use. In the examples so far, we’ve only seen
job parameters in the form of a dedicated command-line argument, things
like --orc datalad-pair-run
. Alternatively this could be expressed
more verbosely through --job-parameter
as --job-parameter
orchestrator=datalad-pair-run
. Or it could be contained as a top-level
key-value pair in a YAML file passed to --job-spec
.
Captured job information¶
When using any DataLad-based orchestrator, the run will ultimately be
captured as a commit in the dataset. In addition to working tree changes
that the command caused (e.g., files it generated), the commit will
include new files under a .reproman/jobs/<resource name>/<job ID>/
directory. Of the files from that directory, the ones described below
are likely to be of the most interest to callers.
- submit
- The batch system submit file (e.g., when the submitter is
condor
, the file passed tocondor_submit
). - runscript
- The wrapper script called by the submit file. It runs the subjob command indicated by its sole command-line argument, an integer that represents the subjob.
- std{out,err}.N
- The standard output and standard error for each subjob command. If
subjob
N
,stderr.N
is where you should look first for more information. - spec.yaml
The “job spec” mentioned in the last section. Any key that does not start with an underscore is a job parameter that can be specified by the caller.
In addition to recording information about the submitted job, this spec can provide a starting point for future
reproman run
calls. You can copy it to a new file, tweak it as desired, and feed it in via--job-spec
. Or, instead of copying the file, you can give the original file to--job-spec
and then override the values as needed with command-line arguments or later--job-spec
values.