Skip to content

File Definitions

Timothy Tickle edited this page Feb 16, 2017 · 62 revisions

Several files are associated with different aspects of studies. Each section here describes a file, indicates their use, and provides an example. Depending on the wishes of the study owner, these files can supplied and downloaded as a part of a study.

Primary sequencing files

Fastq.gz or Fastq.tar.gz

Purpose: Fastq files contain sequence and sequence quality information, availability of these files allows analysis to be repeated and explored at the sequence level. Fastq files must be gzip compressed before uploading. Compression allows faster uploading and downloading for users of the portal. We encourage you to load primary fastq.gz files of non-human samples. Sequence files derived from human samples should be placed in a biological sequence archive and linked to your study. Loading a non-human fastq.gz file or a link to a human fastq file in an external archive are both supported.

If you are loading large or many files, we are available to help guide you through the process. Large files or many files are best uploaded using other methods and not the wizard. Learn more here

Format: To learn more about fastq files and their format try this Wikipedia entry.

Note: Not required for study visualization.

Analysis description files

Coordinates files

Purpose: These files create the different ordinations/plots of cells in the study. You are welcome to create as many of these plots as you would like. Each plot is created by loading a different file. These plots can be 2D or 3D and can contain metadata specific to just the plot (and not shared with other plots). These files/plots do not need to contain all your cells, giving you freedom to plot subsets of your study for targeted visualization.

Format: This is a tab delimited file with 3 required columns (with the option of more) and 2 required header rows. Columns: Columns 1 - 3 are required and are as follows: cell name (NAME), X coordinate (X), and Y coordinate (Y). Additional columns can be included. If a fourth column Z coordinate (Z) column is given, the plot will be 3D using the Z column as the third dimension. Any other additional columns included after the required columns (NAME, X, and Y) will be treated as metadata to be potentially plotted on this plot but not other plots (global metadata should be in the metadata file). Each row of this file is a point/cell in the plot. When defining cell names, please use alphanumeric characters; underscores are also valid. These cell names should match cell names in your other documents. Not all cells in the expression matrix must be in this file, you are welcome to plot subsets of your cell.

Example Coordinates File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --coordinates_file cluster_coordinates_example.txt

Cluster assignments

Purpose: This file indicates which samples are in which cluster and/or subcluster. The current system supports visualizing cells as clusters. In ordinations all members of clusters are painted the same color; when viewing a gene's expression, expression is grouped into box plots per cluster. Optional subclusters can be defined; subclusters are used in tiered analysis when, after large clusters are determined, clusters are decomposed to subgroups of finer resolution.

Format: This is a tab delimited file of three columns: cell name, cluster, and subcluster. The first column contains a cell id, the major cluster grouping for the cell, and a subcluster grouping for the cell (if given). Please use names and not numbers for the different cluster groupings to better describe what the grouping represents. When defining cluster names, please use alphanumeric characters; underscores are also valid.

Example Cluster Assignments File

Note: Required for study visualization.
Note: If this file is deleted, so will the cluster coordinates file be deleted. This is to assure both files are updated when one is updated (these are very related files).

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --cluster_file cluster_assignments_example.txt

Expression matrices

Purpose: This file contains the RNA-Seq expression of a study. The values are used throughout the study in many visualizations. Although the form of the expression data is ultimately up to the author of the study, we recommend some variant of log2(TPM +1).

Format: This is a tab or column delimited file; columns as cells, rows as genes. The first column and first row should be gene ids and cell ids respectively. When defining ids, please use alphanumeric characters; underscores are also valid.

Example Expression Matrices File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --expression expression_matrix_example.txt

Gene lists

Purpose: Panels of genes are often important results in analysis. These can be derived by many methods including differential expression, enrichment analysis, or expert knowledge. Multiple gene lists can be uploaded which will allow others to explore the expression pattern of the set of genes in the list. Parts of the portal will visualize these panels of genes as a group expression within clusters or subclusters of cells (eg. using boxplots). Given clusters, the expression of genes in the cluster is under the control of the author and supplied in this file. Different authors may wish to use different methods to define a measurement that works like an average gene expression in each cluster for the gene.

Format: This is a tab delimited file of at least 2 columns. The first column contains the gene names. The next columns are the measurement (like an average expression) of the genes in a cluster or subcluster (the details of this summary measurement are up to the author of the study). There can be many columns representing different clusters or subclusters.

Example Marker Gene Lists File

Note: Required for study visualization.

Other

Purpose: There is an option to upload an "Other" type file. Please use this to upload files you find important to communicate or document the findings of your study that are not currently supported. Please make sure the description of the file is clear so other scientists can interpret the file correctly.

Note: Not required for study visualization.

Clone this wiki locally