-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
port parallelizing pipelines from qiime2/dev-docs (#30)
- Loading branch information
1 parent
4193287
commit ae3c1f3
Showing
2 changed files
with
325 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,324 @@ | ||
(parallel-configuration)= | ||
# Parallel configuration and usage in QIIME 2 | ||
|
||
```{note} | ||
This is more of an advanced user or system administrator usage document. | ||
[This is slated to move](https://github.com/caporaso-lab/developing-with-qiime2/issues/29) to the new general-purpose user documentation. | ||
``` | ||
|
||
QIIME 2 supports parallelization of pipelines through [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>). | ||
This allows for faster execution of QIIME 2 pipelines by ensuring that pipeline steps that can run simultaneously do run simultaneously assuming the compute resources are available. | ||
|
||
A [Parsl configuration](https://parsl.readthedocs.io/en/stable/userguide/configuring.html) is required to use Parsl. | ||
This configuration tells Parsl what resources are available to it, and how to use them. | ||
How to create and use a Parsl configuration through QIIME 2 depends on which interface you're using and will be detailed on a per-interface basis below. | ||
|
||
For basic usage, we have supplied a vendored configuration that we load from a [`.toml`](https://toml.io/en/) file that will be used by default if you instruct QIIME 2 to execute in parallel without a particular configuration. | ||
This configuration file is shown below. | ||
We write this file the first time you attempt to use it, and the `max(psutil.cpu_count() - 1, 1)` is evaluated and written to the file at that time. | ||
An actual number is required there for all user made config files. | ||
|
||
```toml | ||
[parsl] | ||
strategy = "None" | ||
|
||
[[parsl.executors]] | ||
class = "ThreadPoolExecutor" | ||
label = "default" | ||
max_threads = max(psutil.cpu_count() - 1, 1) | ||
|
||
[[parsl.executors]] | ||
class = "HighThroughputExecutor" | ||
label = "htex" | ||
max_workers = max(psutil.cpu_count() - 1, 1) | ||
|
||
[parsl.executors.provider] | ||
class = "LocalProvider" | ||
|
||
[parsl.executor_mapping] | ||
some_action = "htex" | ||
``` | ||
|
||
An actual `parsel.Config` object in Python looks like the following: | ||
|
||
```python | ||
config = Config( | ||
executors=[ | ||
ThreadPoolExecutor( | ||
label='default', | ||
max_threads=max(psutil.cpu_count() - 1, 1) | ||
), | ||
HighThroughputExecutor( | ||
label='htex', | ||
max_workers=max(psutil.cpu_count() - 1, 1), | ||
provider=LocalProvider() | ||
) | ||
], | ||
# AdHoc Clusters should not be setup with scaling strategy. | ||
strategy=None | ||
) | ||
|
||
# This bit is not part of the actual parsel.Config, but rather is used to tell | ||
# QIIME 2 which non-default executors (if any) you want specific actions to run | ||
# on | ||
mapping = {'some_action': 'htex'} | ||
``` | ||
|
||
The [Parsl documentation](https://parsl.readthedocs.io/en/stable/) provides full detail. | ||
Briefly, we create a [`ThreadPoolExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.ThreadPoolExecutor.html?highlight=Threadpoolexecutor) that parallelizes jobs across multiple threads in a process. | ||
We also create a [`HighThroughputExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.HighThroughputExecutor.html?highlight=HighThroughputExecutor) that parallelizes jobs across multiple processes. | ||
|
||
```{note} | ||
Your config MUST contain an executor with the label default. | ||
This is the executor that QIIME 2 will dispatch your jobs to if you do not specify an executor to use. | ||
The default executor in the default config is the ThreadPoolExecutor meaning that unless you specify otherwise all jobs that use the default config will run on the ThreadPoolExecutor. | ||
``` | ||
|
||
## The Config File | ||
|
||
Let's break down that config file further by constructing it from the ground up using 7 as our max threads/workers. | ||
|
||
```toml | ||
[parsl] | ||
strategy = "None" | ||
``` | ||
|
||
This very first part of the file indicates that this is the parsl section of our config. | ||
That will be the only section we define at the moment, but in the future we expect to expand on this to provide additional QIIME 2 configuration options through this file. | ||
`strategy = 'None'` is a top level Parsl configuration parameter that you can read more about in the Parsl documentation. | ||
This may need to be set differently depending on your system. | ||
If you were to load this into Python using tomlkit you would get the following dictionary: | ||
|
||
```python | ||
{ | ||
'parsl': { | ||
'strategy': 'None' | ||
} | ||
} | ||
``` | ||
|
||
Next, let's add an executor: | ||
|
||
```python | ||
[[parsl.executors]] | ||
class = "ThreadPoolExecutor" | ||
label = "default" | ||
max_threads = 7 | ||
``` | ||
|
||
The `[[ ]]` indicates that this is a list and the `parsl.executors` in the middle indicates that this list is called `executors` and belongs under parsl. | ||
Now our dictionary looks like the following: | ||
|
||
```python | ||
{ | ||
'parsl': { | ||
'strategy': 'None' | ||
'executors': [ | ||
{'class': 'ThreadPoolExecutor', | ||
'label': 'default', | ||
'max_threads': 7} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
To add another executor, we simply add another list element. | ||
Notice that we also have `parsl.executors.provider` for this one. | ||
Some classes of parsl executor require additional classes to fully configure them. | ||
These classes must be specified beneath the executor they belong to. | ||
|
||
```python | ||
[[parsl.executors]] | ||
class = "HighThroughputExecutor" | ||
label = "htex" | ||
max_workers = 7 | ||
|
||
[parsl.executors.provider] | ||
class = "LocalProvider" | ||
``` | ||
|
||
Now our dictionary looks like the following: | ||
|
||
```python | ||
{ | ||
'parsl': { | ||
'strategy': 'None' | ||
'executors': [ | ||
{'class': 'ThreadPoolExecutor', | ||
'label': 'default', | ||
'max_threads': 7}, | ||
{'class': 'HighThroughputExecutor', | ||
'label': 'htex', | ||
'max_workers': 7, | ||
'provider': {'class': 'LocalProvider'}}] | ||
} | ||
} | ||
``` | ||
|
||
Finally, we have the executor_mapping, where you can define which actions, if any, you would like to run on which executors. | ||
If an action is unmapped, it will run on the default executor. | ||
|
||
```python | ||
[parsl.executor_mapping] | ||
some_action = "htex" | ||
``` | ||
|
||
Our final result looks like the following. | ||
The `executor_mapping` internally to tell Parsl where you want you actions to run, while the rest of the information is used to instantiate the `parsl.Config` object shown above. | ||
|
||
```python | ||
{ | ||
'parsl': { | ||
'strategy': 'None' | ||
'executors': [ | ||
{'class': 'ThreadPoolExecutor', | ||
'label': 'default', | ||
'max_threads': 7}, | ||
{'class': 'HighThroughputExecutor', | ||
'label': 'htex', | ||
'max_workers': 7, | ||
'provider': {'class': 'LocalProvider'}}], | ||
'executor_mapping': {'some_action': 'htex'} | ||
} | ||
} | ||
``` | ||
|
||
## Using QIIME 2 in parallel through the command line interface (CLI) | ||
|
||
There are two flags that allow you to parallelize a pipeline through the CLI. | ||
The first is the `--parallel` flag. | ||
This flag will use the following priority order to load a Parsl configuration. | ||
|
||
1. Check the environment variable `$QIIME2_CONFIG` for a filepath to a configuration file. | ||
2. Check the path `<user_config_dir>/qiime2/qiime2_config.toml` | ||
3. Check the path `<site_config_dir>/qiime2/qiime2_config.toml` | ||
4. Check the path `$CONDA_PREFIX/etc/qiime2_config.toml` | ||
5. Write a default configuration to the path in step 4 and use that. | ||
|
||
This implies that after your first time running QIIME 2 in parallel without a config in at least one of the first 3 locations, the path referenced in step 4 will exist and contain the default config (unless you remove the file or switch to a different conda environment). | ||
|
||
The second flag related to parallelization through the command line interface is the `--parallel-config` flag, which is used to provide path to a configuration file. | ||
This allows you to easily create and use your own custom configuration based on your system, and a value provided using this parameter overrides the above priority order. | ||
|
||
````{admonition} user_config_dir | ||
:class: note | ||
On Linux, `user_config_dir` will usually be `$HOME/.config/qiime2/`. | ||
On macOS, it will usually be `$HOME/Library/Application Support/qiime2/`. | ||
You can get find the directory used on your system by running the following command: | ||
```bash | ||
python -c "import appdirs; print(appdirs.user_config_dir('qiime2'))" | ||
``` | ||
```` | ||
|
||
````{admonition} site_config_dir | ||
:class: note | ||
On Linux `site_config_dir` will usually be something like `/etc/xdg/qiime2/`, but it may vary based on Linux distribution. | ||
On macOS it will usually be `/Library/Application Support/qiime2/`. | ||
You can get find the directory used on your system by running the following command: | ||
```bash | ||
python -c "import appdirs; print(appdirs.site_config_dir('qiime2'))" | ||
``` | ||
```` | ||
|
||
|
||
## Using QIIME 2 in parallel through the Python 3 API | ||
|
||
|
||
Parallelization through the Python API is done using `parsl.Config` objects as context managers. | ||
These objects take a `parsl.Config` object and a dictionary mapping action names to executor names. | ||
If no config is provided your default config will be used (found using the same priority order as described for the `--parallel` flag above). | ||
|
||
A `parsl.Config` object itself can be created in several different ways. | ||
|
||
First, you can just create it using Parsl directly. | ||
|
||
```python | ||
import psutil | ||
|
||
from parsl.config import Config | ||
from parsl.providers import LocalProvider | ||
from parsl.executors.threads import ThreadPoolExecutor | ||
from parsl.executors import HighThroughputExecutor | ||
|
||
|
||
config = Config( | ||
executors=[ | ||
ThreadPoolExecutor( | ||
label='default', | ||
max_threads=max(psutil.cpu_count() - 1, 1) | ||
), | ||
HighThroughputExecutor( | ||
label='htex', | ||
max_workers=max(psutil.cpu_count() - 1, 1), | ||
provider=LocalProvider() | ||
) | ||
], | ||
# AdHoc Clusters should not be setup with scaling strategy. | ||
strategy=None | ||
) | ||
``` | ||
|
||
Alternatively, you can create it from a QIIME 2 config file. | ||
|
||
```python | ||
from qiime2.sdk.parallel_config import get_config_from_file | ||
|
||
config, mapping = get_config_from_file('path to config') | ||
|
||
# Or if you have no mapping | ||
config, _ = get_config_from_file('path to config') | ||
|
||
# Or if you only have a mapping and are getting the config from elsewhere | ||
_, mapping = get_config_from_file('path_to_config') | ||
``` | ||
|
||
Once you have your config and/or your mapping, you can use it as follows: | ||
|
||
```python | ||
from qiime2.sdk.parallel_config import ParallelConfig | ||
|
||
|
||
# Note that the mapping can also be a dictionary literal | ||
with ParallelConfig(parallel_config=config, action_executor_mapping=mapping): | ||
future = # <your_qiime2_action>.parallel(args) | ||
# Make sure to call _result inside of the context manager | ||
result = future._result() | ||
``` | ||
|
||
````{admonition} Note for parallel Pipeline developers | ||
:class: warning | ||
If you have something like this in a pipeline: | ||
```python | ||
try: | ||
result1, result2 = some_action(*args) | ||
except SomeException: | ||
do.something() | ||
``` | ||
You must call `_result()` on the return value from `some_action` in the try/except block. | ||
This is necessary to allow users to run your pipeline in parallel. | ||
If you do not do this, and a user attempts to run your pipeline in parallel, it will most likely fail. | ||
```python | ||
try: | ||
results = some_action(*args) | ||
result1, result2 = results._result() | ||
except SomeException: | ||
do.something() | ||
``` | ||
The reason this needs to be done is a bit technical. | ||
Basically, if the pipeline is being executed in parallel, the return value from the action will be a future that will eventually resolve into your results when the parallel thread returns. | ||
Calling `._result()` blocks the main thread and waits for results before proceeding. | ||
If you do not call `_result()` in the try/except, the future will most likely resolve into results after the main Python thread has exited the try/except block. | ||
This will lead to the exception not being caught because it is now actually being raised outside of the try/except. | ||
This is a bit confusing, as parallelism often is. | ||
We tried hard to ensure that developers wouldn't need to change anything about their pipelines to parallelize them, but we did need to make this one concession. | ||
```` |