Execution on Resources
Once a resource is present in your inventory (see Managing
resources), ReproMan provides a few ways to execute command(s)
on the resource. The first is to request an interactive shell for a
resource with reproman login. Another is to
use reproman execute, which is suitable
for running one-off commands on the resource (though, as its manpage
indicates, it’s capable of a bit more). To some degree, you can think of
login and execute as analogous to ssh HOST and ssh HOST
COMMAND, respectively, where the ReproMan variants provide a common
interface across resource types.
The final way to execute a command is reproman run.
Run
reproman run is concerned with running “jobs” on remote resources. It is executed from the local machine and handles three high-level tasks:
Prepare: Move data to the remote resource and set up the execution environment.
Execute Job: Submit the job to a batch system or run the command directly.
Collect: Gather results and finalize the workflow, including retrieving outputs, collecting metadata and logs, and bringing everything together locally to complete the full workflow.
Reference example
Let’s first establish a simple example that we can reference as we cover some of the details. In a terminal, we’re visiting a DataLad dataset where the working tree looks like this:
.
|-- clean.py
`-- data
|-- f0.csv -> ../.git/annex/objects/[...]
`-- f1.csv -> ../.git/annex/objects/[...]
The clean.py script takes two positional arguments (e.g., ./clean.py
data/f0.csv cleaned/f0.csv), where the first is a data file to process
and the second is a path to write the output (creating directories if
necessary).
Choosing an orchestrator
Orchestrators are responsible for preparing the remote and collecting the results.
The complete set of orchestrators, accompanied by descriptions, can be seen by
calling reproman run --list=orchestrators.
Note
Although DataLad is not a strict requirement, having it installed on at least the local machine is strongly recommended, and without it only a limited set of functionality is available. If you are new to DataLad, consider reading the DataLad handbook.
Choose the orchestrator based on your setup and needs:
For remote resources with DataLad (recommended):
``datalad-pair`` - Best for persistent remote datasets
Creates and maintains DataLad datasets on the remote
Commits results directly on the remote with full provenance
Retrieves results using datalad update and datalad get
Marks completed jobs with git refs (refs/reproman/JOBID)
``datalad-pair-run`` - Best for capturing runs in local dataset
Prepares remote dataset like
datalad-pairPackages results in tarball based on file modification times
Creates a datalad run commit in your local repository
Marks local commit with git ref (refs/reproman/JOBID)
For remote resources without DataLad:
``datalad-local-run`` - Remote execution, local DataLad integration
Uses plain remote directory (no DataLad on remote required)
Captures results as datalad run commit locally
Good when remote lacks DataLad but you want local provenance
``plain`` - Simple remote execution
Basic file transfer using session.put() and session.get()
No DataLad integration or provenance tracking
Creates working directory named with job ID
Sufficient for simple tasks but DataLad orchestrators recommended
For local execution:
``datalad-no-remote`` - Local dataset execution
Executes in current local dataset directory
Behaves like
datalad-pairbut stays localAvailable for local shell resources only
Good for testing workflows locally
Revisiting our concrete example and assuming we have
an SSH resource named “foo” in our inventory, here’s how we could
specify that the datalad-pair-run orchestrator should be used:
$ reproman run --resource foo \
--orc datalad-pair-run --input data/f0.csv \
./clean.py data/f0.csv cleaned/f0.csv
Notice that in addition to the orchestrator, we specify the input file that needs to be available on the remote. This is only necessary for files that are tracked by git-annex. Files tracked by Git do not need to be declared as inputs because the same revision of the dataset is checked out on the remote.
Warning
The orchestration with DataLad datasets is work in progress, with some rough edges. You might end up in a state that ReproMan doesn’t know how to sync. Please report any issues you encounter on the issue tracker .
Choosing a submitter
Another, easier decision is which submitter to use. This comes down to
which, if any, batch system your remote resource supports. The currently
available options are pbs, condor, or local. With local,
the job is executed directly through sh rather than submitted to a
batch system.
Our last example invocation could be extended to use Condor like so:
$ reproman run --resource foo \
--sub condor \
--orc datalad-pair-run --input data/f0.csv \
./clean.py data/f0.csv cleaned/f0.csv
Note that which batch systems are currently supported is mostly a matter of which systems ReproMan developers currently have at their disposal. If you would like to add support for your system (or have experience with more general approach like DRMAA), we’d welcome help in this area.
Detached jobs
By default, when a run command is executed, it submits the job,
registers it locally, and exits. The registered jobs can be viewed and
managed with reproman jobs. To list all jobs,
run reproman jobs without any arguments. To fetch a completed job
back into the local dataset, call reproman jobs NAME, where NAME
is a substring of the job ID that uniquely identifies the job.
In cases where you prefer run to stay attached and fetch the job
when it is finished, pass the --follow argument to reproman run.
Concurrent subjobs
If you’re submitting a job to a batch system, it’s likely that you want
to submit concurrent subjobs. To continue with the toy example from above, you’d want to have two jobs, each one running
clean.py on a different input file.
reproman run has two options for specifying subjobs:
--batch-parameter and --batch-spec. The first can work for
simple cases, like our example:
$ reproman run --resource foo --sub condor --orc datalad-pair-run \
--batch-parameter name=f0,f1 \
--input 'data/{p[name]}.csv' \
./clean.py data/{p[name]}.csv cleaned/{p[name]}.csv
A subjob will be created for each name value, with any {p[name]}
field in the input, output, and command strings formatted with the
value. In this case, the two commands executed on the remote would be
./clean.py data/f0.csv cleaned/f0.csv
./clean.py data/f1.csv cleaned/f1.csv
The --batch-spec option is the more cumbersome but more flexible
counterpart to --batch-parameter. Its value should point to a YAML
file that defines a series of records, each one with all of the
parameters for a single subjob command. The equivalent of
--batch-parameter name=f0,f1 would be a YAML file with the following
content:
- name: f0
- name: f1
Warning
When there is more than one subjob, *-run orchestrators do not
create a valid run commit. Specifically, datalad rerun could not
be used to rerun the commit on the local machine because the values
for the inputs, outputs, and command do not correspond to concrete
values. This is an unresolved issue, but at this point the commit
should be considered as a way to capture the information about the
remote command execution—one that certainly provides more
information than logging into the remote and running
condor_submit yourself.
Job parameters
To define a job, ReproMan builds up a “job spec” from job parameters.
Call reproman run --list=parameters to see a list of available
parameters. The parameters can be specified within a file passed to the
--job-spec option, as a key-value pair specified via the
--job-parameter option, or through a dedicate command-line option.
The last option is only available for a subset of parameters, with the
intention of giving these parameters more exposure and making them
slightly more convenient to use. In the examples so far, we’ve only seen
job parameters in the form of a dedicated command-line argument, things
like --orc datalad-pair-run. Alternatively this could be expressed
more verbosely through --job-parameter as --job-parameter
orchestrator=datalad-pair-run. Or it could be contained as a top-level
key-value pair in a YAML file passed to --job-spec.
Captured job information
When using any DataLad-based orchestrator, the run will ultimately be
captured as a commit in the dataset. In addition to working tree changes
that the command caused (e.g., files it generated), the commit will
include new files under a .reproman/jobs/<resource name>/<job ID>/
directory. Of the files from that directory, the ones described below
are likely to be of the most interest to callers.
- submit
The batch system submit file (e.g., when the submitter is
condor, the file passed tocondor_submit).- runscript
The wrapper script called by the submit file. It runs the subjob command indicated by its sole command-line argument, an integer that represents the subjob.
- std{out,err}.N
The standard output and standard error for each subjob command. If subjob
N,stderr.Nis where you should look first for more information.- spec.yaml
The “job spec” mentioned in the last section. Any key that does not start with an underscore is a job parameter that can be specified by the caller.
In addition to recording information about the submitted job, this spec can provide a starting point for future
reproman runcalls. You can copy it to a new file, tweak it as desired, and feed it in via--job-spec. Or, instead of copying the file, you can give the original file to--job-specand then override the values as needed with command-line arguments or later--job-specvalues.