Example Workflow Structures

Scientific workflows are commonly used in scientific research and other data-intensive applications to manage complex computational processes. However, complexity is not a requirement - even simple structures can benefit from being implemented as a workflow. Here are some basic examples.

Single Job

This is the simplest type of scientific workflow, where a single job is executed to perform a specific task. For example, running a single simulation or data analysis on a set of input files.

wf = Workflow(name="single-job") in_file = File("input.data") out_file = File("output.data") job = Job(my_process)\ .add_args("example")\ .add_inputs(in_file)\ .add_outputs(out_file) wf.add_jobs(job) wf.write()

Set of Independent Jobs

This type of workflow consists of multiple jobs executed in parallel, with each job performing a distinct task. Each job is independent of other jobs and can run simultaneously on different computing resources, making this approach more efficient than a single job. For example, processing a set of images or running multiple simulations with varying input parameters.

wf = Workflow(name="independent-jobs") for i in range(1, n + 1): in_file = File(f"input-{i}.data") out_file = File(f"output-{i}.data") job = Job(my_process)\ .add_args(str(i))\ .add_inputs(in_file)\ .add_outputs(out_file) wf.add_jobs(job) wf.write()

Simple Split/Merge Workflow

Workflow with split and merge jobs: This workflow involves splitting a single input into multiple independent jobs, each of which performs a distinct task. The outputs of these jobs are then merged back together to form a single output. This approach can be used to parallelize a workflow across multiple computing resources, allowing multiple jobs to be executed in parallel. For example, processing large datasets or running multiple simulations in parallel.

wf = Workflow(name="split-merge") common_file = File("common.data") common_job = Job(my_process)\ .add_outputs(common_file) wf.add_jobs(common_job) # define merge jobs first, so we can reference # it easily in the loop output_file = File("output.data") merge_job = Job(my_process)\ .add_outputs(output_file) wf.add_jobs(merge_job) for i in range(1, n + 1): proc_out_file = File(f"file-{i}.data") job = Job(my_process)\ .add_args(str(i))\ .add_inputs(common_file)\ .add_outputs(proc_out_file) wf.add_jobs(job) # make sure merge jobs get the correct data merge_job.add_inputs(proc_out_file) wf.write()