port parallelizing pipelines from qiime2/dev-docs (#30)

caporaso-lab · Mar 7, 2024 · ae3c1f3 · ae3c1f3
1 parent 4193287
commit ae3c1f3
Show file tree

Hide file tree

Showing 2 changed files with 325 additions and 0 deletions.
diff --git a/book/_toc.yml b/book/_toc.yml
@@ -60,6 +60,7 @@ parts:
         title: Introduction
         sections:
           - file: framework/explanations/architecture
+          - file: framework/how-to-guides/parallel-configuration
 
   - caption: "CI Internals"
     chapters:

diff --git a/book/framework/how-to-guides/parallel-configuration.md b/book/framework/how-to-guides/parallel-configuration.md
@@ -0,0 +1,324 @@
+(parallel-configuration)=
+# Parallel configuration and usage in QIIME 2
+
+```{note}
+This is more of an advanced user or system administrator usage document.
+[This is slated to move](https://github.com/caporaso-lab/developing-with-qiime2/issues/29) to the new general-purpose user documentation. 
+```
+
+QIIME 2 supports parallelization of pipelines through [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>).
+This allows for faster execution of QIIME 2 pipelines by ensuring that pipeline steps that can run simultaneously do run simultaneously assuming the compute resources are available.
+
+A [Parsl configuration](https://parsl.readthedocs.io/en/stable/userguide/configuring.html) is required to use Parsl.
+This configuration tells Parsl what resources are available to it, and how to use them.
+How to create and use a Parsl configuration through QIIME 2 depends on which interface you're using and will be detailed on a per-interface basis below.
+
+For basic usage, we have supplied a vendored configuration that we load from a [`.toml`](https://toml.io/en/) file that will be used by default if you instruct QIIME 2 to execute in parallel without a particular configuration.
+This configuration file is shown below.
+We write this file the first time you attempt to use it, and the `max(psutil.cpu_count() - 1, 1)` is evaluated and written to the file at that time.
+An actual number is required there for all user made config files.
+
+```toml
+[parsl]
+strategy = "None"
+
+[[parsl.executors]]
+class = "ThreadPoolExecutor"
+label = "default"
+max_threads = max(psutil.cpu_count() - 1, 1)
+
+[[parsl.executors]]
+class = "HighThroughputExecutor"
+label = "htex"
+max_workers = max(psutil.cpu_count() - 1, 1)
+
+[parsl.executors.provider]
+class = "LocalProvider"
+
+[parsl.executor_mapping]
+some_action = "htex"
+```
+
+An actual `parsel.Config` object in Python looks like the following:
+
+```python
+config = Config(
+    executors=[
+        ThreadPoolExecutor(
+            label='default',
+            max_threads=max(psutil.cpu_count() - 1, 1)
+        ),
+        HighThroughputExecutor(
+            label='htex',
+            max_workers=max(psutil.cpu_count() - 1, 1),
+            provider=LocalProvider()
+        )
+    ],
+    # AdHoc Clusters should not be setup with scaling strategy.
+    strategy=None
+)
+
+# This bit is not part of the actual parsel.Config, but rather is used to tell
+# QIIME 2 which non-default executors (if any) you want specific actions to run
+# on
+mapping = {'some_action': 'htex'}
+```
+
+The [Parsl documentation](https://parsl.readthedocs.io/en/stable/) provides full detail.
+Briefly, we create a [`ThreadPoolExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.ThreadPoolExecutor.html?highlight=Threadpoolexecutor) that parallelizes jobs across multiple threads in a process.
+We also create a [`HighThroughputExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.HighThroughputExecutor.html?highlight=HighThroughputExecutor) that parallelizes jobs across multiple processes.
+
+```{note} 
+Your config MUST contain an executor with the label default.
+This is the executor that QIIME 2 will dispatch your jobs to if you do not specify an executor to use.
+The default executor in the default config is the ThreadPoolExecutor meaning that unless you specify otherwise all jobs that use the default config will run on the ThreadPoolExecutor.
+```
+
+## The Config File
+
+Let's break down that config file further by constructing it from the ground up using 7 as our max threads/workers.
+
+```toml
+[parsl]
+strategy = "None"
+```
+
+This very first part of the file indicates that this is the parsl section of our config.
+That will be the only section we define at the moment, but in the future we expect to expand on this to provide additional QIIME 2 configuration options through this file.
+`strategy = 'None'` is a top level Parsl configuration parameter that you can read more about in the Parsl documentation.
+This may need to be set differently depending on your system.
+If you were to load this into Python using tomlkit you would get the following dictionary:
+
+```python
+{
+    'parsl': {
+        'strategy': 'None'
+        }
+}
+```
+
+Next, let's add an executor:
+
+```python
+[[parsl.executors]]
+class = "ThreadPoolExecutor"
+label = "default"
+max_threads = 7
+```
+
+The `[[ ]]` indicates that this is a list and the `parsl.executors` in the middle indicates that this list is called `executors` and belongs under parsl.
+Now our dictionary looks like the following:
+
+```python
+{
+    'parsl': {
+        'strategy': 'None'
+        'executors': [
+            {'class': 'ThreadPoolExecutor',
+            'label': 'default',
+            'max_threads': 7}
+            ]
+        }
+}
+```
+
+To add another executor, we simply add another list element.
+Notice that we also have `parsl.executors.provider` for this one.
+Some classes of parsl executor require additional classes to fully configure them.
+These classes must be specified beneath the executor they belong to.
+
+```python
+[[parsl.executors]]
+class = "HighThroughputExecutor"
+label = "htex"
+max_workers = 7
+
+[parsl.executors.provider]
+class = "LocalProvider"
+```
+
+Now our dictionary looks like the following:
+
+```python
+{
+    'parsl': {
+        'strategy': 'None'
+        'executors': [
+            {'class': 'ThreadPoolExecutor',
+                'label': 'default',
+                'max_threads': 7},
+            {'class': 'HighThroughputExecutor',
+                'label': 'htex',
+                'max_workers': 7,
+                'provider': {'class': 'LocalProvider'}}]
+        }
+}
+```
+
+Finally, we have the executor_mapping, where you can define which actions, if any, you would like to run on which executors.
+If an action is unmapped, it will run on the default executor.
+
+```python
+[parsl.executor_mapping]
+some_action = "htex"
+```
+
+Our final result looks like the following.
+The `executor_mapping` internally to tell Parsl where you want you actions to run, while the rest of the information is used to instantiate the `parsl.Config` object shown above.
+
+```python
+{
+    'parsl': {
+        'strategy': 'None'
+        'executors': [
+            {'class': 'ThreadPoolExecutor',
+                'label': 'default',
+                'max_threads': 7},
+            {'class': 'HighThroughputExecutor',
+                'label': 'htex',
+                'max_workers': 7,
+                'provider': {'class': 'LocalProvider'}}],
+        'executor_mapping': {'some_action': 'htex'}
+        }
+}
+```
+
+## Using QIIME 2 in parallel through the command line interface (CLI)
+
+There are two flags that allow you to parallelize a pipeline through the CLI.
+The first is the `--parallel` flag.
+This flag will use the following priority order to load a Parsl configuration.
+
+1. Check the environment variable `$QIIME2_CONFIG` for a filepath to a configuration file.
+2. Check the path `<user_config_dir>/qiime2/qiime2_config.toml`
+3. Check the path `<site_config_dir>/qiime2/qiime2_config.toml`
+4. Check the path `$CONDA_PREFIX/etc/qiime2_config.toml`
+5. Write a default configuration to the path in step 4 and use that.
+
+This implies that after your first time running QIIME 2 in parallel without a config in at least one of the first 3 locations, the path referenced in step 4 will exist and contain the default config (unless you remove the file or switch to a different conda environment).
+
+The second flag related to parallelization through the command line interface is the `--parallel-config` flag, which is used to provide path to a configuration file.
+This allows you to easily create and use your own custom configuration based on your system, and a value provided using this parameter overrides the above priority order.
+
+````{admonition} user_config_dir
+:class: note
+On Linux, `user_config_dir` will usually be `$HOME/.config/qiime2/`.
+On macOS, it will usually be `$HOME/Library/Application Support/qiime2/`.
+
+You can get find the directory used on your system by running the following command:
+
+```bash
+python -c "import appdirs; print(appdirs.user_config_dir('qiime2'))"
+```
+````
+
+````{admonition} site_config_dir
+:class: note
+On Linux `site_config_dir` will usually be something like `/etc/xdg/qiime2/`, but it may vary based on Linux distribution.
+On macOS it will usually be `/Library/Application Support/qiime2/`.
+
+You can get find the directory used on your system by running the following command:
+
+```bash
+python -c "import appdirs; print(appdirs.site_config_dir('qiime2'))"
+```
+````
+
+
+## Using QIIME 2 in parallel through the Python 3 API
+
+
+Parallelization through the Python API is done using `parsl.Config` objects as context managers.
+These objects take a `parsl.Config` object and a dictionary mapping action names to executor names.
+If no config is provided your default config will be used (found using the same priority order as described for the `--parallel` flag above).
+
+A `parsl.Config` object itself can be created in several different ways.
+
+First, you can just create it using Parsl directly.
+
+```python
+import psutil
+
+from parsl.config import Config
+from parsl.providers import LocalProvider
+from parsl.executors.threads import ThreadPoolExecutor
+from parsl.executors import HighThroughputExecutor
+
+
+config = Config(
+    executors=[
+        ThreadPoolExecutor(
+            label='default',
+            max_threads=max(psutil.cpu_count() - 1, 1)
+        ),
+        HighThroughputExecutor(
+            label='htex',
+            max_workers=max(psutil.cpu_count() - 1, 1),
+            provider=LocalProvider()
+        )
+    ],
+    # AdHoc Clusters should not be setup with scaling strategy.
+    strategy=None
+)
+```
+
+Alternatively, you can create it from a QIIME 2 config file.
+
+```python
+from qiime2.sdk.parallel_config import get_config_from_file
+
+config, mapping = get_config_from_file('path to config')
+
+# Or if you have no mapping
+config, _ = get_config_from_file('path to config')
+
+# Or if you only have a mapping and are getting the config from elsewhere
+_, mapping = get_config_from_file('path_to_config')
+```
+
+Once you have your config and/or your mapping, you can use it as follows:
+
+```python
+from qiime2.sdk.parallel_config import ParallelConfig
+
+
+# Note that the mapping can also be a dictionary literal
+with ParallelConfig(parallel_config=config, action_executor_mapping=mapping):
+    future = # <your_qiime2_action>.parallel(args)
+    # Make sure to call _result inside of the context manager
+    result = future._result()
+```
+
+````{admonition} Note for parallel Pipeline developers
+:class: warning
+If you have something like this in a pipeline:
+
+```python
+try:
+    result1, result2 = some_action(*args)
+except SomeException:
+    do.something()
+```
+
+You must call `_result()` on the return value from `some_action` in the try/except block.
+This is necessary to allow users to run your pipeline in parallel.
+If you do not do this, and a user attempts to run your pipeline in parallel, it will most likely fail.
+
+```python
+try:
+    results = some_action(*args)
+    result1, result2 = results._result()
+except SomeException:
+    do.something()
+```
+
+The reason this needs to be done is a bit technical.
+Basically, if the pipeline is being executed in parallel, the return value from the action will be a future that will eventually resolve into your results when the parallel thread returns.
+Calling `._result()` blocks the main thread and waits for results before proceeding.
+
+If you do not call `_result()` in the try/except, the future will most likely resolve into results after the main Python thread has exited the try/except block.
+This will lead to the exception not being caught because it is now actually being raised outside of the try/except.
+
+This is a bit confusing, as parallelism often is.
+We tried hard to ensure that developers wouldn't need to change anything about their pipelines to parallelize them, but we did need to make this one concession.
+````