Skip to content

File Definitions

Timothy Tickle edited this page Jan 3, 2017 · 62 revisions

Several files are associated with different aspects of studies. Each section here describes a file, indicates their use, and provides an example. Depending on the wishes of the study owner, these files can supplied and downloaded as a part of a study.

Primary sequencing files

Fastq.gz or Fastq.tar.gz

Purpose: Fastq files contain sequence and sequence quality information, availability of these files allows analysis to be repeated and explored at the sequence level. Fastq files must be gzip compressed before uploading. This allows faster uploading and downloading for users of the portal. We encourage you to load primary fastq.gz files of non-human samples. Sequence files derived from human samples should be placed in a biological sequence archive and linked to your study. Loading a non-human fastq.gz file or a link to a human fastq file in an external archive are both supported.

If you are loading many files, you are welcome to use tarred archives of fastq files. Please make sure they have the extension fastq.tar.gz.

Format: To learn more about fastq files and their format try this Wikipedia entry.

Note: Not required for study visualization.

Analysis description files

Cluster coordinates file

Purpose: This file creates the main ordination of cells in the study.

Format: This is a tab delimited file of three columns: cell name, x coordinate, and y coordinate. Each row of this file is a point in the main study ordination. When defining cell names, please use alphanumeric characters; underscores are also valid.

Example Cluster Coordinates File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --coordinates_file cluster_coordinates_example.txt

Cluster assignments

Purpose: This file indicates which samples are in which cluster and/or subcluster. The current system supports visualizing cells as clusters. In ordinations all members of clusters are painted the same color; when viewing a gene's expression, expression is grouped into box plots per cluster. Optional subclusters can be defined; subclusters are used in tiered analysis when, after large clusters are determined, clusters are decomposed to subgroups of finer resolution.

Format: This is a tab delimited file of three columns: cell name, cluster, and subcluster. The first column contains a cell id, the major cluster grouping for the cell, and a subcluster grouping for the cell (if given). Please use names and not numbers for the different cluster groupings to better describe what the grouping represents. When defining cluster names, please use alphanumeric characters; underscores are also valid.

Example Cluster Assignments File

Note: Required for study visualization.
Note: If this file is deleted, so will the cluster coordinates file be deleted. This is to assure both files are updated when one is updated (these are very related files).

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --cluster_file cluster_assignments_example.txt

Expression matrices

Purpose: This file contains the RNA-Seq expression of a study. The values are used throughout the study in many visualizations. Although the form of the expression data is ultimately up to the author of the study, we recommend some variant of log2(TPM +1).

Format: This is a tab or column delimited file; columns as cells, rows as genes. The first column and first row should be gene ids and cell ids respectively. When defining ids, please use alphanumeric characters; underscores are also valid.

Example Expression Matrices File

Note: Required for study visualization.

# To check the format of your file use the script verify_portal_file.py
# Requires python 3.x
python verify_portal_file.py --expression expression_matrix_example.txt

Gene lists

Purpose: Panels of genes are often important results in analysis. These can be derived by many methods including differential expression, enrichment analysis, or expert knowledge. Multiple gene lists can be uploaded which will allow others to explore the expression pattern of the set of genes in the list. Parts of the portal will visualize these panels of genes as a group expression within clusters or subclusters of cells (eg. using boxplots). Given clusters, the expression of genes in the cluster is under the control of the author and supplied in this file. Different authors may wish to use different methods to define a measurement that works like an average gene expression in each cluster for the gene.

Format: This is a tab delimited file of at least 2 columns. The first column contains the gene names. The next columns are the measurement (like an average expression) of the genes in a cluster or subcluster (the details of this summary measurement are up to the author of the study). There can be many columns representing different clusters or subclusters.

Example Marker Gene Lists File

Note: Required for study visualization.

Other

Purpose: There is an option to upload an "Other" type file. Please use this to upload files you find important to communicate or document the findings of your study that are not currently supported. Please make sure the description of the file is clear so other scientists can interpret the file correctly.

Note: Not required for study visualization.

Clone this wiki locally