-
Notifications
You must be signed in to change notification settings - Fork 0
/
README_snp_sites.txt
33 lines (29 loc) · 3.59 KB
/
README_snp_sites.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
README for directory "*snp_sites"
by David Brown (db) 20221206
Files created by or processed from outputs of the 'snp-sites' found at https://github.com/sanger-pathogens/snp-sites
***
Directory Contents:
- execute_snp_sites.slurm == SLURM script to call "run_snpsites.py"
- ns_prot_mutS.fasta == FASTA protein alignment containing only SNPs (reference 'snp-sites' documentation) from "subset_900_ns_protein_mutS.faa" below
- ns_prot_mutS.forMesquite.tsv == tab-separated matrix of characters and states formatted (hopefully) for Mesquite (rows are sequences/end nodes, columns are characters)
- ns_prot_mutS.original.tsv == tab-separated matrix of characters and states formatted for human eyes (same as above, but including column names)
-- columns are (in order from left to right):
* "asm_acc" == sequence identifier (no underscores)
* "seq_str" == sequence (amino acid) as a string (generated by 'snp-sites' from a longer set of aligned input sequences)
* "snp_grp" == SNP pattern group identifier (either useful as a Mesquite categorical character or integer identifier)
* "snp_var" == total number of SNPs per each sequence that differ from the most commonly observed bases at all positions of the sequence (continuous/count character)
* "pos_###" == all other columns until the final column represent positions (in the original alignment processed by 'snp-sites') and indicate the binary presence/absence of a base identical to that of the most commonly observed base at that specific position
* "MDR_bin" == final column is binary presence/absence of a multidrug resistance (MDR) genotype, as defined by resistance to 3 or more classes of tested drugs by CDC NARMS at CLSI standards
- ns_prot_mutS.original.aa_sequence.tsv == tab-separated matrix of ONLY the amino acid sequence formatted for human eyes (subset from original above, but including column names)
- ns_prot_mutS.original.binary_chars.tsv == tab-separated matrix of ONLY binary characters and states formatted for human eyes (subset from original above, but including column names)
- ns_prot_mutS.original.categorical_chars.tsv == tab-separated matrix of ONLY categorical characters and states formatted for human eyes (subset from original above, but including column names)
- ns_prot_mutS.original.continuous_chars.tsv == tab-separated matrix of ONLY continuous characters and states formatted for human eyes (subset from original above, but including column names)
- ns_prot_mutS.phy == relaxed Phylip format protein alignment containing only SNPs (reference 'snp-sites' documentation)
- ns_prot_mutS.SNPmatrix.ftr == feather format SNP matrix
- ns_prot_mutS.SNPmatrix.tsv == tab-separated SNP matrix
- ns_prot_mutS.vcf == variant call format file of SNP data (reference 'snp-sites' documentation)
- run_snpsites.py == script to call 'snp-sites' as well as processing outputs for Mesquite and creating SNP matrices
- snpsites_3348834.out == outputs from the SLURM job
- subset_900_no_split_core_and_soft_core.ftr == feather format pandas dataframe containing gene presence/absence information for the "core" and "soft core" segments of the pangenome calculated by Roary (https://sanger-pathogens.github.io/Roary/)
- subset_900_ns_protein_mutS.faa == FASTA protein (aligned) translations from loci annotated with the "mutS" gene name key word on the original nucleotide sequences; used as input for "run_snpsites.py"
***