Skip to content

File Definitions

Timothy Tickle edited this page Feb 16, 2017 · 62 revisions

Several files are associated with different aspects of studies. Each section here describes a file, indicates their use, and provides an example. Depending on the wishes of the study owner, these files can be supplied and downloaded as a part of a study.

Primary sequencing files

Fastq.gz or Fastq.tar.gz

Purpose: Fastq files contain sequence and sequence quality information, availability of these files allows analysis to be repeated and explored at the sequence level. Fastq files must be gzip compressed before uploading. Compression allows faster uploading and downloading for users of the portal. We encourage you to load primary fastq.gz files of non-human samples. Sequence files derived from human samples should be placed in a biological sequence archive and linked to your study. Loading a non-human fastq.gz file or a link to a human fastq file in an external archive are both supported.

If you are loading large or many files, we are available to help guide you through the process. Large files or many files are best uploaded using other methods and not the wizard. Learn more here

Format: To learn more about fastq files and their format try this Wikipedia entry.

Note: Not required for study visualization.

Analysis and Visualization files

Coordinates files

Purpose: These files create the different ordinations/plots of cells in the study. You are welcome to create as many of these plots as you would like. Each plot is created by loading a different file. These plots can be 2D or 3D and can contain metadata specific to just the plot (and not shared with other plots). These files/plots do not need to contain all your cells, giving you freedom to plot subsets of your study for targeted visualization.

Format: This is a tab delimited file with 3 required columns (with the option of more) and 2 required header rows.

Columns: Columns 1 - 3 are required and are as follows: cell name ("NAME"), X coordinate ("X"), and Y coordinate ("Y"). Additional columns can be included. If a fourth column Z coordinate ("Z") column is given, the plot will be 3D using the Z column as the third dimension. Any other additional columns included after the required columns ("NAME", "X", and "Y") will be treated as metadata to be potentially plotted on this plot but not other plots (global metadata should be in the metadata file). When defining cell names, please use alphanumeric characters; underscores are also valid. These cell names should match cell names in your other documents. Not all cells in the expression matrix must be in this file, you are welcome to plot subsets of your cell.

Rows: Rows are tab delimited. The first two rows are required. The first row should start with "NAME", "X", "Y" and optionally "Z". Additional entries are the names of the optional plot-specific metadata you want to include in just this plot. These are the names the viewer will see when selecting metadata in the plot. The second row indicates what type of data is given in each column; "numeric" for continuous data or "group" for categorical labels. This row starts with "TYPE", "numeric", "numeric", and optionally "numeric" for the Z coordinate column. Additional entries will be either "numeric" or "group" indicating the data type of any additional columns. Starting at the third row, each row of this file is a point/cell in the plot with entries for each column (defining cell names, coordinates, and other associated metadata).

Example Coordinates File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --coordinates_file coordinates_example.txt

Cluster assignments

Purpose: This file provides metadata describing cells in the study. The metadatum provided in this file are interpreted as either "group" (categorical/factor) or "numeric" (continuous) data. These metadata are available throughout the visualization portal. Metadata will be available to paint cell plots. Categorical metadata will paint the cells with discrete color panels (as discrete groups, each with their own color). Continuous metadata will paint cells as a gradient of color. Not only do these metadata determine color on plots of cells, when viewing genes across cells these metadata are also used. If a categorical metadata is currently being viewed and a gene is searched, the gene will be viewed as boxplots, each boxplot of cells a group in the metadata. If continuous metadata is currently being viewed, gene will be plotted against the metadata as a scatter plot. There is no restriction on the number of metadata to make available. These metadata will be globally available to all plots; if you would like metadata to be restricted to a specific plot of cells, include it in coordinates file used to make that specific plot.

Format: This is a tab delimited file with one required column and two required rows.

Columns: The first column is required are cell names, one should include all cells given in the expression file. Additional columns are different metadata to be viewed.

Rows: The first of the two rows starts with the entry "NAME", after this the name of the metadata contained in each column is given. This is the name users will see and select in the portal. The second row starts with "TYPE" and then contains the value "group" or "numeric" describing the column of metadata. Additional rows describe a cell, given first a cell name and then metadata entries. The cell names should match cell names in other files. Please try to use descriptive metadata, naming groups in ways others will understand as they view them. Please use alphanumeric characters; underscores are also valid.

Example Metadata File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --metadata_file metadata_example.txt

Expression matrices

Purpose: This file contains the measurements of a study; in RNA-Seq, this would be the expression of your cells. The values are used throughout the study in many visualizations. Although the form of the expression data is ultimately up to the author of the study, we recommend some variant of log2(TPM +1). You will be able to indicate what format you are uploading in the upload wizard so that, when viewing expression, the axes are correctly labeled. If you are wanting to additionally upload a raw matrix of measurements (like a count matrix), this can be done as a miscellaneous file in the wizard.

Format: This is a tab or comma delimited file with columns as cells, rows as genes. The first column should be gene names, the first row should be cell names. When defining names, please use alphanumeric characters; underscores are also valid.

Example Expression Matrices File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --expression expression_example.txt

Gene lists

Purpose: Panels of genes are often important results in analysis. These gene lists can be derived by many methods (eg. differential expression, enrichment analysis, and expert knowledge). Multiple gene lists can be uploaded which will allow others to explore the genes in the list. The portal will visualize these genes within different metadata groupings provided in this file. Given groups, the actual measurement (expression) of genes in the cluster is under the control of the author and supplied in this file. Different authors may wish to use different methods to summarize genes over a group of cells (like an average, trimmed average, or median expression of the gene in cell groups), supplying those summary measurements gives you complete control over the method used.

Format: This is a tab delimited file of at least 2 columns.

Columns: The first column contains the gene names, these names should match gene names in other documents in your study. You may include an arbitrary number of genes and need not include all the genes in your study. As well, different gene lists may partially overlap as needed. The next columns are the measurement (like an average expression) of the genes in different categorical metadata groupings of interest. There can be many columns representing different categorical metadata groups.

Rows: The first row starts with the entry "GENE NAMES" and the name of the categorical groupings from metadata for which one would like to provide summary measurements (like averages within that group). Each row after the first row starts with a gene name and then the summary measurement for that gene in the cells that are a part of the metadata grouping for that column.

Example Marker Gene Lists File

Note: Not required for study visualization.

Miscellaneous

Purpose: There is an option to upload a "Miscellaneous" type file. Please use this to upload files you find important to communicate or document the findings of your study that are not currently supported. Please make sure the description of the file is clear so other scientists can interpret the file correctly.

Note: Not required for study visualization.

Clone this wiki locally