Execution on Resources

Once a resource is present in your inventory (see Managing resources), ReproMan provides a few ways to execute command(s) on the resource. The first is to request an interactive shell for a resource with reproman login. Another is to use reproman execute, which is suitable for running one-off commands on the resource (though, as its manpage indicates, it’s capable of a bit more). To some degree, you can think of login and execute as analogous to ssh HOST and ssh HOST COMMAND, respectively, where the ReproMan variants provide a common interface across resource types.

The final way to execute a command is reproman run.

Run

reproman run is concerned with running “jobs” on remote resources. It is executed from the local machine and handles three high-level tasks:

Prepare: Move data to the remote resource and set up the execution environment.

Execute Job: Submit the job to a batch system or run the command directly.

Collect: Gather results and finalize the workflow, including retrieving outputs, collecting metadata and logs, and bringing everything together locally to complete the full workflow.

Reference example

Let’s first establish a simple example that we can reference as we cover some of the details. In a terminal, we’re visiting a DataLad dataset where the working tree looks like this:

.
|-- clean.py
`-- data
    |-- f0.csv -> ../.git/annex/objects/[...]
    `-- f1.csv -> ../.git/annex/objects/[...]

The clean.py script takes two positional arguments (e.g., ./clean.py data/f0.csv cleaned/f0.csv), where the first is a data file to process and the second is a path to write the output (creating directories if necessary).

Choosing an orchestrator

Orchestrators are responsible for preparing the remote and collecting the results. The complete set of orchestrators, accompanied by descriptions, can be seen by calling reproman run --list=orchestrators.

Note

Although DataLad is not a strict requirement, having it installed on at least the local machine is strongly recommended, and without it only a limited set of functionality is available. If you are new to DataLad, consider reading the DataLad handbook.

Choose the orchestrator based on your setup and needs:

For remote resources with DataLad (recommended):

``datalad-pair`` - Best for persistent remote datasets
- Creates and maintains DataLad datasets on the remote
- Commits results directly on the remote with full provenance
- Retrieves results using datalad update and datalad get
- Marks completed jobs with git refs (refs/reproman/JOBID)
``datalad-pair-run`` - Best for capturing runs in local dataset
- Prepares remote dataset like datalad-pair
- Packages results in tarball based on file modification times
- Creates a datalad run commit in your local repository
- Marks local commit with git ref (refs/reproman/JOBID)

For remote resources without DataLad:

``datalad-local-run`` - Remote execution, local DataLad integration
- Uses plain remote directory (no DataLad on remote required)
- Captures results as datalad run commit locally
- Good when remote lacks DataLad but you want local provenance
``plain`` - Simple remote execution
- Basic file transfer using session.put() and session.get()
- No DataLad integration or provenance tracking
- Creates working directory named with job ID
- Sufficient for simple tasks but DataLad orchestrators recommended

For local execution:

``datalad-no-remote`` - Local dataset execution
- Executes in current local dataset directory
- Behaves like datalad-pair but stays local
- Available for local shell resources only
- Good for testing workflows locally

Revisiting our concrete example and assuming we have an SSH resource named “foo” in our inventory, here’s how we could specify that the datalad-pair-run orchestrator should be used:

$ reproman run --resource foo \
  --orc datalad-pair-run --input data/f0.csv \
  ./clean.py data/f0.csv cleaned/f0.csv

Notice that in addition to the orchestrator, we specify the input file that needs to be available on the remote. This is only necessary for files that are tracked by git-annex. Files tracked by Git do not need to be declared as inputs because the same revision of the dataset is checked out on the remote.

Warning

The orchestration with DataLad datasets is work in progress, with some rough edges. You might end up in a state that ReproMan doesn’t know how to sync. Please report any issues you encounter on the issue tracker .

Choosing a submitter

Another, easier decision is which submitter to use. This comes down to which, if any, batch system your remote resource supports. The currently available options are pbs, condor, or local. With local, the job is executed directly through sh rather than submitted to a batch system.

Our last example invocation could be extended to use Condor like so:

$ reproman run --resource foo \
   --sub condor \
   --orc datalad-pair-run --input data/f0.csv \
  ./clean.py data/f0.csv cleaned/f0.csv

Note that which batch systems are currently supported is mostly a matter of which systems ReproMan developers currently have at their disposal. If you would like to add support for your system (or have experience with more general approach like DRMAA), we’d welcome help in this area.

Detached jobs

By default, when a run command is executed, it submits the job, registers it locally, and exits. The registered jobs can be viewed and managed with reproman jobs. To list all jobs, run reproman jobs without any arguments. To fetch a completed job back into the local dataset, call reproman jobs NAME, where NAME is a substring of the job ID that uniquely identifies the job.

In cases where you prefer run to stay attached and fetch the job when it is finished, pass the --follow argument to reproman run.

Concurrent subjobs

If you’re submitting a job to a batch system, it’s likely that you want to submit concurrent subjobs. To continue with the toy example from above, you’d want to have two jobs, each one running clean.py on a different input file.

reproman run has two options for specifying subjobs: --batch-parameter and --batch-spec. The first can work for simple cases, like our example:

$ reproman run --resource foo --sub condor --orc datalad-pair-run \
  --batch-parameter name=f0,f1 \
  --input 'data/{p[name]}.csv'  \
  ./clean.py data/{p[name]}.csv cleaned/{p[name]}.csv

A subjob will be created for each name value, with any {p[name]} field in the input, output, and command strings formatted with the value. In this case, the two commands executed on the remote would be

./clean.py data/f0.csv cleaned/f0.csv
./clean.py data/f1.csv cleaned/f1.csv

The --batch-spec option is the more cumbersome but more flexible counterpart to --batch-parameter. Its value should point to a YAML file that defines a series of records, each one with all of the parameters for a single subjob command. The equivalent of --batch-parameter name=f0,f1 would be a YAML file with the following content:

- name: f0
- name: f1

Warning

When there is more than one subjob, *-run orchestrators do not create a valid run commit. Specifically, datalad rerun could not be used to rerun the commit on the local machine because the values for the inputs, outputs, and command do not correspond to concrete values. This is an unresolved issue, but at this point the commit should be considered as a way to capture the information about the remote command execution—one that certainly provides more information than logging into the remote and running condor_submit yourself.

Job parameters

To define a job, ReproMan builds up a “job spec” from job parameters. Call reproman run --list=parameters to see a list of available parameters. The parameters can be specified within a file passed to the --job-spec option, as a key-value pair specified via the --job-parameter option, or through a dedicate command-line option.

The last option is only available for a subset of parameters, with the intention of giving these parameters more exposure and making them slightly more convenient to use. In the examples so far, we’ve only seen job parameters in the form of a dedicated command-line argument, things like --orc datalad-pair-run. Alternatively this could be expressed more verbosely through --job-parameter as --job-parameter orchestrator=datalad-pair-run. Or it could be contained as a top-level key-value pair in a YAML file passed to --job-spec.

When a value is specified in multiple sources, the order of precedence is the dedicated option, then the value specified via --job_parameters, and finally the value contained in a --job-spec YAML file. When multiple --job-spec arguments are given and define a conflicting key, the value from the last specified file wins.

Captured job information

When using any DataLad-based orchestrator, the run will ultimately be captured as a commit in the dataset. In addition to working tree changes that the command caused (e.g., files it generated), the commit will include new files under a .reproman/jobs/<resource name>/<job ID>/ directory. Of the files from that directory, the ones described below are likely to be of the most interest to callers.

submit

The batch system submit file (e.g., when the submitter is condor, the file passed to condor_submit).

runscript

The wrapper script called by the submit file. It runs the subjob command indicated by its sole command-line argument, an integer that represents the subjob.

std{out,err}.N

The standard output and standard error for each subjob command. If subjob N, stderr.N is where you should look first for more information.

spec.yaml

The “job spec” mentioned in the last section. Any key that does not start with an underscore is a job parameter that can be specified by the caller.

In addition to recording information about the submitted job, this spec can provide a starting point for future reproman run calls. You can copy it to a new file, tweak it as desired, and feed it in via --job-spec. Or, instead of copying the file, you can give the original file to --job-spec and then override the values as needed with command-line arguments or later --job-spec values.