-
Notifications
You must be signed in to change notification settings - Fork 98
Variant Storage Engine
- Study oriented
- Cohort definition
Different vcf types:
- Aggregated VCFs Variant files with no sample specific values. Just aggregated data
- Merged VCFs Variant files with a batch of samples with specific samples data.
- gVCFs Single sample files with information for all the positions.
Split into steps:
- Transform
- Load
- Annotate
- Calculate Stats
- Validation
- Variant Normalization
- Variant Merging Plugin dependent.
Annotate variants using CellBase annotator. Can use other annotators like VEP. The model of the variant annotation is defined in the project Biodata, in variantAnnotation.avdl
The VariantAnnotation model includes a field for adding extra annotation attributes. This field is intended to contain custom annotation provided by the end user.
Additional attributes can be grouped by source. Each source will contain a set of key-value attributes creating this structure:
VariantAnnotation = {
// ...
"additionalAttributes" : {
"<source1>" : {
"attribute" : {
"<key1>":"<value>",
"<key2>":"<value>",
"<key3>":"<value>"
}
},
"<source2>" : {
"attribute" : {
"<key1>":"<value>",
"<key2>":"<value>",
"<key3>":"<value>"
}
}
}
}
OpenCGA Storage is able to load this custom annotation from 3 different formats: GFF, BED and VCF. When loading the new annotation data, the user has to provide a name for the new custom annotation. Because the structure of these file formats is slightly different, the information loaded won't be the same.
GFF and BED files describe features within a region, providing a chromosome, start and end. All the variants between the start and end will be annotated with the information.
- GFF : From this file format, only the third column, containing the feature is extracted and loaded with the key "feature"
This line of GFF will generate the next additionalAttributes:
chr22 TeleGene enhancer 16053659 16063659 500 + . touch1
VariantAnnotation = {
// ...
"additionalAttributes" : {
"myGff" : {
"attribute" : {
"feature":"enhancer"
}
}
}
}
- BED : From the bed format, columns name (4th), score (5th) and strand (6th) will be loaded.
This line of BED will generate the next additionalAttributes:
chr22 16053659 16063659 Pos1 353 + 127471196 127472363 255,0,0 0 A A
VariantAnnotation = {
// ...
"additionalAttributes" : {
"myBed" : {
"attribute" : {
"name":"Pos1",
"score":"353",
"strand":"+"
}
}
}
}
- VCF : This format is not region based, so each line will modify a single variant. All the INFO column will be loaded as additional attributes.
This line of VCF will generate the next additionalAttributes:
chr22 16050075 A G . 100 PASS FEATURE=specific;SCORE=300;STRAND=+
VariantAnnotation = {
// ...
"additionalAttributes" : {
"myVcf" : {
"attribute" : {
"FEATURE":"specific",
"SCORE":"300",
"STRAND":"+"
}
}
}
}
- Example with multiple sources: In case of having custom annotations from more than one source, more than one source will appear in the additionalAttributes field:
VariantAnnotation = {
// ...
"additionalAttributes" : {
"myVcf" : {
"attribute" : {
"FEATURE":"specific",
"SCORE":"300",
"STRAND":"+"
}
},
"myBed" : {
"attribute" : {
"name":"Pos1",
"score":"353",
"strand":"+"
}
}
}
}
🚧
- Variant stats (cohorts)
- Global stats
- Sample stats (pending)
Once we have loaded variants, it's time to query and get some filtered results. This can be done using the different clients available (Java, Python, JavaScript, R, ...). Read more about the available filters at Querying Variant Data
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About