Example Workflow Structures
Scientific workflows are commonly used in scientific research and other data-intensive applications to manage complex computational processes. However, complexity is not a requirement - even simple structures can benefit from being implemented as a workflow. Here are some basic examples.
Single Job
This is the simplest type of scientific workflow, where a single job is executed to perform a specific task. For example, running a single simulation or data analysis on a set of input files.
wf = Workflow(name="single-job")
in_file = File("input.data")
out_file = File("output.data")
job = Job(my_process)\
.add_args("example")\
.add_inputs(in_file)\
.add_outputs(out_file)
wf.add_jobs(job)
wf.write()
Set of Independent Jobs
This type of workflow consists of multiple jobs executed in parallel, with each job performing a distinct task. Each job is independent of other jobs and can run simultaneously on different computing resources, making this approach more efficient than a single job. For example, processing a set of images or running multiple simulations with varying input parameters.
wf = Workflow(name="independent-jobs")
for i in range(1, n + 1):
in_file = File(f"input-{i}.data")
out_file = File(f"output-{i}.data")
job = Job(my_process)\
.add_args(str(i))\
.add_inputs(in_file)\
.add_outputs(out_file)
wf.add_jobs(job)
wf.write()
Simple Split/Merge Workflow
Workflow with split and merge jobs: This workflow involves splitting a single input into multiple independent jobs, each of which performs a distinct task. The outputs of these jobs are then merged back together to form a single output. This approach can be used to parallelize a workflow across multiple computing resources, allowing multiple jobs to be executed in parallel. For example, processing large datasets or running multiple simulations in parallel.
wf = Workflow(name="split-merge")
common_file = File("common.data")
common_job = Job(my_process)\
.add_outputs(common_file)
wf.add_jobs(common_job)
# define merge jobs first, so we can reference
# it easily in the loop
output_file = File("output.data")
merge_job = Job(my_process)\
.add_outputs(output_file)
wf.add_jobs(merge_job)
for i in range(1, n + 1):
proc_out_file = File(f"file-{i}.data")
job = Job(my_process)\
.add_args(str(i))\
.add_inputs(common_file)\
.add_outputs(proc_out_file)
wf.add_jobs(job)
# make sure merge jobs get the correct data
merge_job.add_inputs(proc_out_file)
wf.write()