-
Notifications
You must be signed in to change notification settings - Fork 98
Indexing Variant Data v0.6.0
VCF can be indexed executing either using the implemented pipeline in the CLI or using the Java API. The aim of this indexation is allow making queries over the indexed data. Indexing data happens in two consecutive steps: transformation and load. During the transformation the VCF data is normalized and converted into an internal variant data model (see Data Models). During the load this normalized and validated file will be loaded in the active storage engine plugin. For more information about the indexation process, see OpenCGA Storage Engine.
For this testing area, we are going to use a sample VCF data from the 1000 Genomes Project. You can use any other file, but all the examples below use the VCF file ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz
OpenCGA provides two ways of indexing VCF data:
- OpenCGA Analysis: this is a high level CLI that uses OpenCGA Catalog.
- OpenCGA Storage: this is a low level CLI completely independent of OpenCGA Catalog. You should use this one if you already have a metadata server and only need a Storage Indexing capabilities.
You have the complete description of OpenCGA command line interface at Command Line, this is just a quick start example. First of all, you will need a user account. This user information will be only stored in your Catalog. For instance:
./opencga.sh users create -u myuser -p mypass -e my@e.mail -n "my name"
Then, you can organize your data in several projects, and several studies in each project. Here we only will have one of each:
./opencga.sh projects create -a myproject -n "Default project" -d "First project created." -u myuser -p mypass
./opencga.sh studies create -a mystudy --project-id myuser@myproject -n "Default study" -d "First study created." -u myuser -p mypass
Now we can do the actual indexing. First, we have to tell Catalog where is the file we want to operate on:
./opencga.sh files create -u myuser -p mypass -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --study-id myuser@myproject/mystudy --bioformat VARIANT
And ask for the indexation. If you have properly configured your storage engine (currently MongoDB or HBase) and want to do the transformation and load, you can go straight and do:
./opencga.sh files index -u myuser -p mypass -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -Dannotate=false -- --include-genotypes --compress-genotypes
if you also want to annotate the variants and compute statistics about the genotypes, run instead:
./opencga.sh files index -u myuser -p mypass -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -- --include-genotypes --compress-genotypes --calculate-stats
Now your vcf should be indexed and ready to receive queries. You can do so with:
./opencga.sh files info -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -u myuser -p mypass --output-format IDS > myfileid.txt
./opencga-storage.sh fetch-variants -r 20:16050050-16050100 -a $(cat myfileid.txt) --database opencga_myuser_myproject
We are currently working on what options should be the default ones, in order to allow using the CLI in the most intuitive way.
A VCF indexation can be done in one or two steps, depending on if you want to delay the database load or not. //: # (It is more illustrative to do the two steps indexation)
A simple indexation may be done like the next command. Note that at this level you must manage your own ids. We will use 1
and 2
for instance:
./opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --database chr22_test_db
If your dataset is big and you want to do smaller steps, it is recommended to split the ETL process in two:
./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --transform
./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz.variants.json.gz --database chr22_test_db --load
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About