-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add plugin tutorial section integrating metadata (#122)
fixes #120
- Loading branch information
1 parent
a40502e
commit f730ade
Showing
4 changed files
with
157 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
(integrate-metadata)= | ||
# Integrate metadata in Actions | ||
|
||
We've done a lot so far in q2-dwq2, but we've left out one thing that is common in almost all *real* plugins: the integration of metadata in Actions. | ||
In this relatively brief section, we'll enable users to optionally provide metadata associated with reference sequences to the tables generated by the `tabulate-las-results` Visualizer and the `search-and-summarize` Pipeline. | ||
This metadata could be just about anything. | ||
For example, reference metadata could be taxonomic information about the reference sequences, enabling viewers of the resulting visualization to infer the taxonomic origin of their query sequences from the taxonomy associated with the closest matching reference sequences. | ||
(That example is also used in the *Sequence Homology Searching* chapter of [An Introduction to Applied Bioinformatics](https://readIAB.org) {cite}`iab-2`, so you can refer there if the idea isn't familar.) | ||
|
||
Let's get started. | ||
|
||
(integrate-metadata-commits)= | ||
```{admonition} tl;dr | ||
:class: tip | ||
The code that I wrote for this section can be found here: {{ dwq2_integrate_metadata_commit_url }}. | ||
``` | ||
|
||
## The input metadata | ||
|
||
In this example, the reference metadata will be a QIIME 2 metadata file that has reference sequence ids as its identifiers, and some number of additional columns containing the actual metadata. | ||
For example, the tab-separated text (`.tsv`) metadata file could look like: | ||
|
||
```text | ||
id phylum class genus species | ||
ref-seq1 Bacteria Bacillota Bacillus_A paranthracis | ||
ref-seq2 Bacteria Pseudomonadota Photobacterium kishitanii | ||
ref-seq3 Bacteria Actinomycetota Cryptosporangium arvum | ||
``` | ||
|
||
This metadata file will be passed into `tabulate_las_results` along with the `LocalAlignmentSearchResults` artifact that it already takes. | ||
|
||
```{note} | ||
By convention in QIIME 2, all relevant identifiers (in our case, those that could be present in the `LocalAlignmentSearchResults` artifact) should be represented in the metadata, or an error should be raised. | ||
However, "extra" identifiers in the metadata that do not show up in the corresponding artifacts (again, the `LocalAlignmentSearchResults` in this example) are allowed. | ||
The goal of this convention is that we want to facilitate users maintaining only a single metadata file. | ||
If an upstream artifact - in this case the sequences in the `FeatureData[Sequence]` artifact that is passed into `local-alignment-search` as `reference_seqs` - is ever filtered to remove some sequences, the user shouldn't have to create a filtered version of their metadata file. | ||
The reason for this is two-fold. | ||
First of all, that would generally be an extra, unnecessary step, so it complicates users' workflows. | ||
But, more importantly, that could result in a proliferation of metadata files that need to be kept in sync. | ||
This is another application of the {term}`DRY` principle: don't encourage your users to duplicate information represented in their metadata file. | ||
``` | ||
|
||
## Add an optional input to `tabulate_las_results` | ||
|
||
Now that we know what the metadata will look like, let's add a new optional input to our `tabulate_las_results` function. | ||
The new input we add will be called `reference_metadata`, and it will be received as a `qiime2.Metadata` object. | ||
Our updated function signature in `_visualizers.py` will look like this: | ||
|
||
```python | ||
def tabulate_las_results(output_dir: str, | ||
hits: pd.DataFrame, | ||
title: str = _tabulate_las_defaults['title'], | ||
reference_metadata: qiime2.Metadata = None) \ | ||
-> None: | ||
``` | ||
|
||
If `reference_metadata` is provided, we'll take a few steps to integrate it into `hits` before we write out our HTML file. | ||
Otherwise, `tabulate_las_results` will behave exactly as it did before. | ||
I added the following `if` block: | ||
|
||
```python | ||
if reference_metadata is not None: | ||
reference_metadata = reference_metadata.to_dataframe() | ||
|
||
hits.reset_index(inplace=True) | ||
|
||
metadata_index = reference_metadata.index.name | ||
metadata_columns = reference_metadata.columns.to_list() | ||
reference_metadata.reset_index(inplace=True) | ||
|
||
missing_ids = \ | ||
set(hits['reference id']) - set(reference_metadata[metadata_index]) | ||
if len(missing_ids) > 0: | ||
raise KeyError( | ||
f"The following {len(missing_ids)} IDs are missing from " | ||
f"reference metadata: {missing_ids}.") | ||
|
||
hits = pd.merge(hits, reference_metadata, | ||
left_on='reference id', | ||
right_on=metadata_index, | ||
how='inner') | ||
|
||
hits.set_index(['query id', 'reference id'], inplace=True) | ||
column_order = \ | ||
['percent similarity', 'alignment length', 'score'] + \ | ||
metadata_columns + ['aligned query', 'aligned reference'] | ||
hits = hits.reindex(columns=column_order) | ||
``` | ||
|
||
This looks like a lot, but it's really a few simple actions. | ||
|
||
1. First, convert the `qiime2.Metadata` object to a `pd.DataFrame` using the built-in method on `qiime2.Metadata`. | ||
1. Then, remove the index from `hits` to prepare it for a `pd.merge` operation. | ||
1. Next, cache the metadata's index name because - importantly - we don't know exactly what this index name will be. | ||
The QIIME 2 metadata format allows for a few available options, including `id`, `sample-id`, `feature-id`, and several others, and we don't want to restrict a user to providing any specific one of these. | ||
We also cache the list of metadata columns, and remove the index to prepare it for the `pd.merge` operation with `hits`. | ||
Remember that we also don't know what metadata columns the user will provide, and in this example we're not putting any restrictions on this. | ||
1. Then, we confirm that all ids that are represented in `hits` are present in `reference_metadata`, and we throw an informative error message if any are missing. | ||
Our error message should help a user identify what's wrong, so I chose to indicate how many ids were missing, and provide a list of them. | ||
1. Next, we merge our hits and the reference metadata on the index of `hits` and the index of our metadata. | ||
Unlike for the metadata, we *do* know what the index name of `hits` will be because we were explicit about this when we defined our `LocalAlignmentSearchResultsFormat`, so we can refer to it directly. | ||
This is an important distinction that differentiates metadata from {term}`File Formats <File Format>` we define: if we need the flexibility to allow for arbitrary column names, we're generally working with metadata. On the other hand, if our column names are predefined, we should generally be working with a File Format. | ||
1. Then, we prepare the `hits` DataFrame for use downstream. | ||
To do this, we first set the MultiIndex. | ||
I also sorted the columns such that all of the metadata columns come between the *percent similarity*, *alignment length*, and *score* columns and the *aligned query* and *reference query* columns of the original `hits` DataFrame. | ||
When developing, I found this column order to be useful for reviewing the results. | ||
|
||
After this, our action proceeds the same as if no metadata was provided. | ||
|
||
```python | ||
with open(os.path.join(output_dir, "index.html"), "w") as fh: | ||
fh.write(_html_template % (title, hits.to_html())) | ||
``` | ||
|
||
## Update `search-and-summarize` | ||
|
||
We'll also want this option to be available to users of `search-and-summarize`. | ||
Since `tabulate-las-results` is already part of the `search-and-summarize` Pipeline, all we need to do in our `_pipelines.py` file is add the new optional parameter (`reference_metadata`), and pass it through in our call to `tabulate_las_results_action`. | ||
Try to do that yourself, and refer to my code ({{ dwq2_integrate_metadata_commit_url }}) as needed. | ||
|
||
## Update `plugin_setup.py` | ||
|
||
Next, we'll need to make the plugin aware of this new parameter when we register the `tabulate-las-results` and `search-and-summarize` actions. | ||
Metadata files are provided as {term}`Parameters <Parameter>` of type `qiime2.plugin.Metadata` on action registration. | ||
|
||
In `plugin_setup.py`, add `Metadata` to the list of imports from `qiime2.plugin`. | ||
Then, add the new parameter and description to the dictionaries we created to house these for `tabulate-las-results`. | ||
|
||
```python | ||
_tabulate_las_parameters = {'title': Str, | ||
'reference_metadata': Metadata} | ||
_tabulate_las_parameter_descriptions = { | ||
'title': 'Title to use inside visualization.', | ||
'reference_metadata': 'Reference metadata to be integrated in output.' | ||
} | ||
``` | ||
|
||
Recall that we reuse these dictionaries when we register `search-and-summarize`, so we only need to add this parameter and description in this one place. | ||
|
||
At this point, you should be ready to use this new functionality with your plugin. | ||
Open a terminal in a environment where your implementation of `q2-dwq2` is installed, run `qiime dev refresh-cache`, and you should see the new parameter in calls to `qiime dwq2 tabulate-las-results --help` and `qiime dwq2 search-and-summarize --help`. | ||
|
||
## Add unit tests and update the `search-and-summarize` usage example | ||
|
||
Finally, wrap this up by adding new unit tests for the functionality. | ||
Do this on your own, and then refer my code ({{ dwq2_integrate_metadata_commit_url }}) to see how I did it. | ||
|
||
In my code for this section, you'll also find a metadata file that I provided which corresponds to the example reference sequences that were provided. | ||
Use that file to update the `search-and-summarize` usage example, and then test your code on the command line with the new usage example. | ||
|
||
|