Skip to content

Commit

Permalink
add plugin tutorial section on adding new artifact class (#57)
Browse files Browse the repository at this point in the history
also starts updating text throughout to differentiate *semantic types* and *artifact classes* - to be continued with #59
  • Loading branch information
gregcaporaso authored Apr 6, 2024
1 parent 4f5777a commit 51ef4af
Show file tree
Hide file tree
Showing 22 changed files with 730 additions and 92 deletions.
1 change: 1 addition & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ parts:
- file: plugins/tutorials/create-from-template
- file: plugins/tutorials/add-nw-align-method
- file: plugins/tutorials/add-alignment-visualizer
- file: plugins/tutorials/add-artifact-class
- file: plugins/tutorials/add-usage-examples
- file: plugins/how-to-guides/intro
sections:
Expand Down
16 changes: 13 additions & 3 deletions book/back-matter/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ Archive
Artifact
A QIIME 2 {term}`Result` that contains data to operate on.
Artifact class
A kind of {term}`Artifact` that can exist.
This is defined by a plugin developer by associating a {term}`semantic type` with a {term}`directory format` when registering an artifact class.
Artifact API
See {term}`Python 3 API`.
Expand All @@ -20,15 +24,20 @@ Deployment
The collection of interfaces and plugins in a deployment can be defined by a {term}`distribution` of QIIME 2.
Directory Format
A string that represents a particular layout of files and or directories as well as how their contents will be structured.
An object that is a subclass of `qiime2.plugin.DirectoryFormat`.
A Directory Format represents a particular layout of a directory that contains files and/or arbitrarily nested sub-directories, and defines how the contents must be structured.
Distribution
A collection of QIIME 2 plugins that are designed to be installed together.
These are generally grouped by a theme. For example, the Amplicon Distribution provides a collection of plugins for analysis of microbiome amplicon data, while the Shotgun Distribution provides a collection of plugins for analysis of microbiome shotgun metagenomics data.
When a distribution is installed, that particular installation of QIIME 2 is an example of a {term}`deployment`.
File Format
An object which subclasses either `qiime2.plugin.TextFileFormat` or `qiime2.plugin.BinaryFileFormat`.
File formats define the particular format of a file, and define a process for validating the format.
Format
A string that represents a particular file format.
See {term}`file format` and {term}`directory format`.
Framework
The engine of orchestration that enables QIIME 2 to function together as a cohesive unit.
Expand Down Expand Up @@ -95,7 +104,8 @@ Result
A generic term for either a {term}`Visualization` or an {term}`Artifact`.
Semantic Type
A {term}`type` that is used to classify {term}`artifacts<Artifact>` and how they can be used.
An identifier that is used to describe what some data is intended to represent, and when and where they can be used.
When associated with a {term}`directory format`, the combination defines an {term}`artifact class`.
These types may be extended by {term}`plugins<Plugin>`.
tl;dr
Expand Down
22 changes: 14 additions & 8 deletions book/framework/explanations/data-storage.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
(data-storage)=
# How Data is Stored

```{warning}
Some of the discussion in this section uses *semantic type* as a synonym for {term}`artifact class`.
While closely related, we're working on clarifying our language to distinguish these ideas.
We haven't reviewed/updated this chapter for this yet.
```

In any software project, data needs to be stored (or persisted).
The way that this is accomplished can impact every facet of the software's design.
In order to better demonstrate *why* certain aspects of QIIME 2 exist as they are we will highlight what goals or constraints QIIME 2 has, and then demonstrate how QIIME 2 achieves each goal.
Expand All @@ -14,18 +20,18 @@ Data should be stored in a way that:
- makes it possible to determine when data is or is not valid as input for a specific analysis
- ease interoperability between tools
- can be extended by plugins
- and include details on its origin and history (provenance), facilitating trust and reproducibility.
- and include details on its origin and history (provenance), facilitating trust and reproducibility.

These goals directly impact many of the design decisions made in QIIME 2.
The solution described in this section seeks to address each of these goals.
There are many other ways to solve these problems when one or more of these constraints are lifted, however QIIME 2 chooses these constraints because we believe they are *useful* to the scientist, will allow *composable* software to be developed and reused to advance the state of the art, and support [FAIR data principles](https://www.go-fair.org/fair-principles/).
There are many other ways to solve these problems when one or more of these constraints are lifted, however QIIME 2 chooses these constraints because we believe they are *useful* to the scientist, will allow *composable* software to be developed and reused to advance the state of the art, and support [FAIR data principles](https://www.go-fair.org/fair-principles/).

### Accessibility and Transferability
QIIME 2 stores all data as a directory structure inside of a ZIP file.
There is a {term}`payload` directory named `/data/` where data is stored in a common format. This permits additional *metadata* to be stored alongside the data (in non-`/data/` directories or files).

A directory-based archive is used to store data in a way that is accessible.
When a common format exists for a data type (e.g., FASTA format for sequence data) it should be used to be as accessible as possible.
A directory-based archive is used to store data in a way that is accessible.
When a common format exists for a data type (e.g., FASTA format for sequence data) it should be used to be as accessible as possible.
When such a format does not exist, it should be stored in plain-text structure that is as self-descriptive as possible.
The goals is that a person in 20 years might be able to glance at it, and roughly understand what the purpose of a given document is (assuming file-systems and text-based encodings still make sense in the future).

Expand All @@ -35,7 +41,7 @@ Alternatively a new format could be invented with new rules (though this would m
For these reasons, QIIME 2 stores data as a directory structure.
In particular data such as FASTA or newick will be considered the {term}`Payload` which is to be delivered to a tool.

````{margin}
````{margin}
```{note}
QIIME 2's `.qza` and `.qzv` files are ZIP files with a specific internal structure.
You can open these with any standard unzip utility (e.g., WinZip, 7zip, `unzip`), even on systems that don't have QIIME 2 installed.
Expand Down Expand Up @@ -66,7 +72,7 @@ To accomplish this, we need data about the data, or *metadata* (in the general s
If the {term}`payload` is placed in a *subdirectory* then we can store additional files which can contain this *metadata* without needed to worry about filename conflicts with the payload itself.
Now QIIME 2 is able to record a type and anything else that may enable the computer (or user) to make a more informed decision about the use of a given piece of data.

See [](types-explanation) to get more information on how types are used and defined in QIIME 2.
See [](types-explanation) to get more information on how types are used and defined in QIIME 2.

### Interoperability and Extension
QIIME 2 stores a string called a {term}`Directory Format` in `metadata.yaml` which instructs the computer what the specific layout of `/data/` is.
Expand All @@ -89,10 +95,10 @@ See [](formats-explanation) to learn more about QIIME 2's formats and directory
### Provenance Metadata

Inside of each Archive, QIIME 2 stores metadata about how that archive was generated.
We call this "provenance".
We call this "provenance".
Notably, each Archive contains provenance information about *every* prior QIIME 2 {term}`Action` involved in its creation, from `import` to the most recent step in the analysis.

This provenance information includes type and format information, system and environment details, the Actions performed and all parameters passed to them, and all registered citations.

See [](provenance-explanation) to learn more about data provenance storage in QIIME 2.
See [](provenance-explanation) to learn more about data provenance storage in QIIME 2.

14 changes: 7 additions & 7 deletions book/framework/explanations/formats.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
(formats-explanation)=
# File Formats and Directory Formats

{term}`Formats <Format>` in QIIME 2 are on-disk representations of {term}`semantic types <Semantic Type>`.
They are the stored in or read from the `data/` when reading and writing {term}`artifacts <Artifact>`.
{term}`Formats <Format>` in QIIME 2 are on-disk representations of data.
When associated with an {term}`artifact class`, they define how data are stored in or read from the `data/` directory of {term}`artifacts <Artifact>`.

QIIME 2 doesn't have much of an opinion on how data should be represented or stored, so long as it *can* be represented or stored in an {term}`archive`.

Expand All @@ -27,9 +27,9 @@ Here we provide an example of a `TextFileFormat` definition, with a focus on the
class IntSequenceFormat(TextFileFormat):
"""
A sequence of integers stored on new lines in a file.
To make validation more interesting, values in the list can be any integer as long
To make validation more interesting, values in the list can be any integer as long
as that integer is not equal to the previous value plus 3
(i.e., `line[i] != (line[i-1]) + 3`).
(i.e., `line[i] != (line[i-1]) + 3`).
"""
def _validate_n_ints(self, n):
with self.open() as fh:
Expand Down Expand Up @@ -128,7 +128,7 @@ class CasavaOneEightSingleLanePerSampleDirFmt(model.DirectoryFormat):
```

## Single File Directory Formats
Currently QIIME 2 requires that all formats registered to a {term}`Semantic Type` be a directory format.
Currently QIIME 2 requires that all formats registered to an {term}`artifact class` be a directory format.
For those cases, there exists a factory for quickly constructing directory layouts that contain *only a single file*.
This requirement might be removed in the future, but for now it is a necessary evil (and also isn't too much extra work for format developers).

Expand All @@ -140,6 +140,6 @@ DNASequencesDirectoryFormat = model.SingleFileDirectoryFormat(
## Associating Formats with a Type

Formats on their own aren't of much use.
It is only once they are registered with a {term}`Semantic Type` to define an *artifact class* that things become interesting.
It is only once they are associated with a {term}`semantic type` to define an *artifact class* that things become interesting.
Artifact classes define the data that can be provided as input or generated as output by QIIME 2 `Actions`.
An example of this can be seen in the [registration of the `SampleData[PairedEndSequencesWithQuality]` artifact class](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_type.py#L66).
An example of this can be seen in the [registration of the `SampleData[PairedEndSequencesWithQuality]` artifact class](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_type.py#L66).
12 changes: 6 additions & 6 deletions book/framework/explanations/provenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ If that Result is saved as a `.qza` or `.qzv`, the captured provenance data pers
[](provenance-goes-in-provenance) contains a detailed discussion of the file structure which holds provenance metadata.

This is a **decentralized** approach to provenance capture: every QIIME 2 Result's provenance is packaged with the Result itself.
This prevents disassociation of a Result with its provenance, for example when a simpler analysis outcome (e.g., a `.png` file containing a graph) is emailed to a colleague.
This prevents disassociation of a Result with its provenance, for example when a simpler analysis outcome (e.g., a `.png` file containing a graph) is emailed to a colleague.
It also prevents accidental mis-association of a Result with the wrong provenance, or with inaccurate or outdated provenance records.
For example, with *centralized approaches to provenance tracking*, like saving scripts or computational notebooks, those records may be updated.
If a result was generated prior to a script update, and the result is inadvertantly not updated, a subsequent user or viewer of that result may have inaccurate information about how it was created.
If a result was generated prior to a script update, and the result is inadvertantly not updated, a subsequent user or viewer of that result may have inaccurate information about how it was created.
In contrast, the provenance captured within a QIIME 2 Archive will always describe the way that Archive was actually created.

QIIME 2's approach is also a form of **retrospective** provenance capture.
Expand Down Expand Up @@ -86,7 +86,7 @@ That directory contains:
- `metadata.yaml` (see [](metadata-yaml))
- `citations.bib`: All citations related to the run Action, in [bibtex format](https://www.bibtex.com/g/bibtex-format/).
(This includes "passthrough" citations like those registered to transformers, regardless of the plugin where they are registered.)
- `action/action.yaml`: A YAML description of the Action and its environment (i.e., the stuff we care most about).
- `action/action.yaml`: A YAML description of the Action and its environment (i.e., the stuff we care most about).
- `action/metadata.tsv` or other data files (optional): Data captured to provide additional Action context.

### The `action.yaml` file
Expand All @@ -98,7 +98,7 @@ The `action.yaml` files are broken into three top-level sections, in this order:
- `environment`: a description of the system and the Python environment where this action was executed

The sample Visualization we're referring to can be viewed [here](https://view.qiime2.org/provenance/?src=https%3A%2F%2Fdocs.qiime2.org%2F2021.4%2Fdata%2Ftutorials%2Fmoving-pictures%2Fcore-metrics-results%2Funweighted_unifrac_emperor.qzv).
That link will open directly to the *Provenance* tab where you can click on the bottom square in that provenance graph (not the circle within the square!) to cross-reference the information provided here.
That link will open directly to the *Provenance* tab where you can click on the bottom square in that provenance graph (not the circle within the square!) to cross-reference the information provided here.

#### The `execution` block

Expand Down Expand Up @@ -174,7 +174,7 @@ action:
- `parameters` lists registered parameter names, and the user-passed (or selected default) values.
- `output-name` is the name assigned to this Action's output *at registration*, which can be useful when determining which of an Action's multiple outputs a file represents.
(This does not capture the user-passed filename, because file names are interface specific and may not always be relevent - for example, when using the Python API, files may not exist for inputs at the time of action execution.)
- `alias-of` is an optional field, present if the Action is the terminal result of a QIIME 2 {term}`Pipeline`.
- `alias-of` is an optional field, present if the Action is the terminal result of a QIIME 2 {term}`Pipeline`.
This value is the UUID of the "inner" result which this pipeline result aliases.

#### The `environment` block
Expand Down Expand Up @@ -239,7 +239,7 @@ The user sees only five of the fifteen captured Results, each of which they ran
Because the bottom two are pipelines, this view simply but completely represents the provenance of the Archive being viewed.

This is possible because QIIME 2 captures redundant "terminal" pipeline outputs that alias the "real" pipeline outputs nested in provenance.
These terminal outputs have the same {term}`Semantic Type` as the Results they alias, but capture provenance details at the scope of the Pipeline, rather than at the scope of the Method or Visualizer they alias.
These terminal outputs are of the same {term}`artifact class` as the Results they alias, but capture provenance details at the scope of the Pipeline, rather than at the scope of the Method or Visualizer they alias.

#### Pipeline provenance example

Expand Down
8 changes: 7 additions & 1 deletion book/framework/explanations/types.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
(types-explanation)=
# Semantic Types, Primitives, and Visualizations

If you're unfamilar with the various ways that the word *type* is used in the context of QIIME 2, we recommend reading [](types-of-types) before this document.
```{warning}
Some of the discussion in this section uses *semantic type* as a synonym for {term}`artifact class`.
While closely related, we're working on clarifying our language to distinguish these ideas.
We haven't reviewed/updated this chapter for this yet.
```

If you're unfamilar with the various ways that the word *type* is used in the context of QIIME 2, we recommend reading [](types-of-types) before this document.
This document provides a deep dive into semantic types, primitives, and visualizations in QIIME 2: in other words, descriptors of the various inputs and/or outputs to QIIME 2 {term}`Actions <Action>`.

```{eval-rst}
Expand Down
Loading

0 comments on commit 51ef4af

Please sign in to comment.