-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathconfig.asv.double_bc.yaml
1274 lines (1218 loc) · 86.2 KB
/
config.asv.double_bc.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
################################################################################
# CONFIGURATION FILE #
#------------------------------------------------------------------------------#
# Configuration file for the CASCABEL pipeline. #
# Set the parameters below, save the file and run Snakemake. #
# The file format is yaml (http://www.yaml.org/). In this file, you specify #
# your input data, barcode mapping file and you can choose tools and parameters#
# according to your needs. Most rules and parameters have default settings. #
# It is very important to keep the indentation of the file (don’t change the #
# tabs and spaces), as well as the name of the parameters/variables. But you #
# can of course change the values of the parameters to deviate from the default#
# settings. Any text after a hashtag (#) is considered a comment and will be #
# ignored by Snakemake. #
# #
# @Author: Julia Engelmann and Alejandro Abdala #
# @Last update: 13/12/2022 #
################################################################################
################################################################################
# GENERAL PARAMETERS SECTION #
#------------------------------------------------------------------------------#
# The general parameters section defines parameters that are global or general #
# for the complete workflow. #
################################################################################
#------------------------------------------------------------------------------#
# Execution mode #
#------------------------------------------------------------------------------#
# This parameter allows the user to inspect intermediate files in order to #
# finetune some downstream analyses, re-do previous steps or exit the workflow.#
# #
#----------------------------- PARAMS -----------------------------#
# #
# -interactive Set this flag to "T" (default) in order to interact at some #
# specific steps with the pipeline. "F" will try to run all the #
# pipeline without communicating intermediate results until the #
# report. #
# For a list for all the interactive checkpoints take a look at the following #
# link: https://github.com/AlejandroAb/CASCABEL/wiki#5-interactive-mode #
#------------------------------------------------------------------------------#
interactive : "T"
#------------------------------------------------------------------------------#
# Project Name #
#------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter with the init_sample.sh script #
# (if used for multiple libraries). #
#------------------------------------------------------------------------------#
PROJECT: ""
#------------------------------------------------------------------------------#
# LIBRARIES/SAMPLES #
#------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis. #
# Use the same library names as with the init_sample.sh script. #
# Include each library name surrounded by quotes, and comma separated. #
# i.e LIBRARY: ["LIB_1","LIB_2",..."LIB_N"] #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples #
# must have the same configuration; use: #
# "PE" for paired-end reads [Default]. #
# "SE" for single-end reads. #
#------------------------------------------------------------------------------#
LIBRARY: [""]
LIBRARY_LAYOUT: "PE"
#------------------------------------------------------------------------------#
# RUN #
#------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces. #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally). #
# The RUN parameter can be set here or remain empty, in the latter case, the #
# user must assign this value via the command line. #
# i.e: --config RUN=run_name #
#------------------------------------------------------------------------------#
RUN: ""
#------------------------------------------------------------------------------#
# Description #
#------------------------------------------------------------------------------#
# Brief description of the run. Any description written here will be included #
# in the final report. This field is not mandatory so it can remain empty. #
#------------------------------------------------------------------------------#
description: ""
#------------------------------------------------------------------------------#
# INPUT TYPE #
#------------------------------------------------------------------------------#
# Cascabel supports two types of input files, fastq and gzipped fastq files. #
# This parameter can take the values "T" if the input files are gzipped #
# (only the reads!, the metadata file always needs to be uncompressed) or "F" #
# if the input files are regular fastq files. #
#------------------------------------------------------------------------------#
gzip_input: "F"
#------------------------------------------------------------------------------#
# INTERMEDIATE FILES #
#------------------------------------------------------------------------------#
# By default CASCABEL will delete most of the generated intermediate files. #
# Set this flag to true ("T") in order to keep all the intermediate files. #
# Default: "F" remove the files! #
#------------------------------------------------------------------------------#
KEEP_TMP: "F"
#------------------------------------------------------------------------------#
# INPUT FILES #
#------------------------------------------------------------------------------#
# To run Cascabel for multiple libraries you can provide an input file, tab #
# separated with the following columns: #
# - Library: Name of the library (this have to match with the values entered #
# in the LIBRARY variable described above). #
# - Forward reads: Full path to the forward reads. #
# - Reverse reads: Full path to the reverse reads (only for paired-end). #
# - metadata: Full path to the file with the information for #
# demultiplexing the samples (only if needed). #
# The full path of this file should be supplied in the input_files variable, #
# otherwise, you have to enter the FULL PATH for both: the raw reads and the #
# metadata file (barcode mapping file). The metadata file is only needed if #
# you want to perform demultiplexing. #
# If you want to avoid the creation of this file a third solution is available #
# using the script init_sample.sh. More info at the project Wiki: #
# https://github.com/AlejandroAb/CASCABEL/wiki#21-input-files #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - fw_reads: Full path to the raw reads in forward direction (R1) #
# - rw_reads: Full path to the raw reads in reverse direction (R2) #
# - metadata: Full path to the metadata file with barcodes for each sample #
# to perform library demultiplexing #
# - input_files: Full path to a file with the information for the library(s) #
# #
# ** Please supply only one of the following: #
# - fw_reads, rv_reads and metadata #
# - input_files #
# - or use init_sample.sh script directly #
#------------------------------------------------------------------------------#
fw_reads: ""
rv_reads: ""
metadata: ""
#OR
input_files: ""
#------------------------------------------------------------------------------#
# ASV_WF: Binned qualities and Big data workflow #
#------------------------------------------------------------------------------#
# For fastq files with binned qualities (e.g. NovaSeq and NextSeq) the error #
# learning process within dada2 can be affected, and some data scientists #
# suggest that enforcing monotonicity could be beneficial for the analysis. #
# In this section, you can modify key parameters to enforce monotonicity and #
# also go through a big data workflow when the number of reads may exceed the #
# physical memory limit.
# More on binned qualities: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
# You can also follow this excellent thread about binned qualities and Dada2: https://forum.qiime2.org/t/novaseq-and-dada2-incompatibility/25865/8
#------------------------------------------------------------------------------#
binned_q_scores: "F" #Binned quality scores.Set this to "T" if your data comes from e.g. NextSeq sequencing machine.
big_data_wf: "F" #Set to true when your sequencing run contains more than 10^9 reads (depends on RAM availability!)
################################################################################
# REPORT PARAMETER SECTION #
#------------------------------------------------------------------------------#
# This section defines parameters that will influence the type of report to be #
# generated at the end of the workflow. #
################################################################################
#------------------------------------------------------------------------------#
# PDF Report #
#------------------------------------------------------------------------------#
# By default, Cascabel creates the final report in HTML format. In order to #
# create the report also as pdf file, set this flag to "T". #
# Important! in order to convert the file to pdf format, it is necessary to #
# execute the pipeline within a Xserver session i.e MobaXterm or ssh -X. #
# One way to validate if your active session is using an Xserver, execute #
# 'echo $DISPLAY' on a command line terminal. If this returns empty, you do #
# not have an Xserver session. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - pdfReport "T" for generate pdf report or "F" to skip it. #
# - wkhtmltopdf_command name w/wo path to execute the html to pdf translation.#
#------------------------------------------------------------------------------#
pdfReport: "T"
wkhtmltopdf_command: "wkhtmltopdf -T 10mm -B 30mm"
#------------------------------------------------------------------------------#
# Portable Report #
#------------------------------------------------------------------------------#
# Cascabel creates the final report in HTML format, containing references to #
# other images or links. Therefore just copying the HTML files for sharing or #
# inspecting the results will break these links. #
# By setting 'portableReport' to true "T", CASCABEL will generate a zip file #
# with all the resources necessary to share and distribute CASCABEL's report. #
#------------------------------------------------------------------------------#
portableReport: "T"
#------------------------------------------------------------------------------#
# Krona Report #
#------------------------------------------------------------------------------#
# Krona allows hierarchical data to be explored with zooming, multi-layered pie#
# charts. The interactive charts are self-contained and can be viewed with any #
# modern web browser. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - report Indicate with "T"/"F" if CASCABEL should generate a Krona #
# chart. #
# - ktImportText Name of the command to invoke this krona utility. #
# - samples Indicate the samples to be included in the chart, use comma #
# separated values of samples (same name as the ones supplied #
# in the metadata barcode file). Or "all" to include all the #
# samples. #
# - otu_table Target OTU table for the report. Use "default" to use the #
# filtered OTU table (exclude singletons). Or "singletons" to #
# use the non filtered OTU table (include singletons) #
# - extra_params Any other extra parameter from ktImportText tool. default #
# "-n root_extra" #
#------------------------------------------------------------------------------#
krona:
report: "T"
command: "ktImportText"
samples: "all"
otu_table: "default"
extra_params: "-n root_extra"
################################################################################
# Specific Parameters Section #
#------------------------------------------------------------------------------#
# In this section of the configuration file, you can find all the parameters #
# used to run the different rules during the execution of the pipeline. #
# Some of the entries below contain a parameter called "extra_params". #
# This parameter is designed to allow the user to pass any other extra #
# parameter to the program invoked by the rule, as some rules do not list all #
# the parameters of the underlying tool explicitly. In these cases, the user #
# can specify any other parameter using "extra_params". #
# IMPORTANT NOTE: #
# After defining the type of analysis, in the header of the comments for each #
# set of parameters, you will see a prefix indicating if the parameters/options#
# apply for the OTU workflow, the ASV workflow or both. For more information in#
# the type of analyses, please refer to the next section "ANALYSIS TYPE". #
################################################################################
#------------------------------------------------------------------------------#
# ANALYSIS TYPE #
# rules: #
#------------------------------------------------------------------------------#
# Cascabel supports two main types of analysis: #
# 1) Analysis based on traditional OTUs (Operational Taxonomic Units) which #
# are mostly generated by clustering sequences based on a shared #
# similarity threshold. #
# 2) Analysis based on ASVs (Amplicon sequence variants). This kind of #
# analysis tries to distinguish errors in the sequence reads from true #
# sequence variants, down to the level of single-nucleotide differences. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - ANALYSIS_TYPE "OTU" or "ASV". Defines the type analysis #
#------------------------------------------------------------------------------#
ANALYSIS_TYPE: "ASV"
#------------------------------------------------------------------------------#
# BOTH_WF: UNPAIRED DATA WORK FLOW #
#------------------------------------------------------------------------------#
# A regular workflow for marker gene analysis using paired-end data, #
# comprehends the merging of forward and reverse reads, prior to continuing #
# with downstream analysis implemented within this pipeline. #
# However, primers can intentially amplify fragments which are so large that #
# forward and reverse reads do not overlap. A regular analysis would discard #
# all those "unpaired" reads during the FW and RV read assembly. For this #
# scenario, Cascabel implements an alternative flow, where instead of #
# continuing with assembled reads, un-assembled reads are concatenated together#
# with a degenerated base 'N' between the FW read and the reverse complemented #
# RV read (which does not significantly influence k-mer based classification #
# methods such as RDP). #
#----------------------------- PARAMS -----------------------------#
# #
# #
# - UNPAIRED_DATA_PIPELINE "T" or "F". True to work with the "un-assembled" #
# reads. If ANALYSIS_TYPE = "ASV" this option use #
# the "justConcatenate" option from the #
# mergePairs() function from the dada2 package. #
# - CHAR_TO_PAIR In-silico base used to pair both fragments. #
# Valid values A|T|G|C|N. You can use more than one#
# base, e.g. "GGGGG" pair with 5 Gs. #
# This parameter only applies for the OTU-WF. For #
# the ASV-WF, dada2 merge both reads with 10 Ns. #
# - QUALITY_CHAR The forward and reverse reads are paired #
# in-silico in a fastq file, thus the quality for #
# the pairing bases must be provided. If more than #
# one CHAR_TO_PAIR is used, you only need to #
# specify one and only one QUALITY_CHAR. #
# This parameter only applies for the OTU-WF. For #
# the ASV-WF, dada2 merge both reads with 10 Ns. #
#------------------------------------------------------------------------------#
UNPAIRED_DATA_PIPELINE: "F"
CHAR_TO_PAIR: "T"
QUALITY_CHAR: "G"
#------------------------------------------------------------------------------#
# BOTH_WF: Quality Control tool #
#------------------------------------------------------------------------------#
# Select the tool for uality control analysis. #
# Valid options are: FASTQC, SEQUALI, BOTH #
# If you wish to have warnings on interactive behavio, FASTQC or BOTH options #
# should be used. #
#------------------------------------------------------------------------------#
QC: "sequali"
#------------------------------------------------------------------------------#
# BOTH_WF: Quality control with Sequali #
# rules: sequali #
#------------------------------------------------------------------------------#
# Sequali (https://github.com/rhpvorderman/sequali). #
# Sequence quality metrics for FASTQ and uBAM files. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - threads Number of threads to run sequali. #
# - extra_params Extra parameters. See sequali --help for more info. #
#------------------------------------------------------------------------------#
sequali:
threads: 10
extra_params: ""
#------------------------------------------------------------------------------#
# BOTH_WF: Quality control with FastQC #
# rules: fast_qc, validateQC #
#------------------------------------------------------------------------------#
# FastQC evaluates 12 main concepts on the sequences: basic statistics, per #
# base sequence quality, per tile sequence quality, per sequence quality #
# scores, per base sequence content, per sequence GC content, per base N #
# content, sequence length distribution, sequence duplication levels, #
# overrepresented sequences, adapter content and Kmer content. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - command Command needed to invoke fastqc [default: "fastqc"]. #
# - extra_params Extra parameters. See fastqc --help for more info. #
# - threads Number of threads to run FastQC. #
# - qcLimit Set the maximum number of FastQC FAILS (from the 12 tests #
# evaluated) accepted before interrupting the workflow if the #
# 'interactive' mode is equal to "T". #
#------------------------------------------------------------------------------#
fastQC:
command: "fastqc"
extra_params: ""
threads: 10
qcLimit : 3
#------------------------------------------------------------------------------#
# BOTH_WF: Assemble fragments (merge forward with reverse reads) #
# rule: pear #
#------------------------------------------------------------------------------#
# This step is performed to merge paired reads. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - t Minimum length of reads after trimming low quality bases. #
# - v Minimum overlap size. The minimum overlap may be set to 1 #
# when the statistical test is used. (default in pear: 10) #
# - j Number of threads to use. #
# - p The p-value cutoff used in the statistical test. Valid #
# options are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 #
# disables the test. (default: 0.01) #
# - extra_params Extra parameters. See pear --help for more info. #
# - prcpear The minimum percentage of expected paired reads, if the #
# actual percentage is lower and the 'interactive' parameter #
# is set to "T"; a warning message will be shown. #
#------------------------------------------------------------------------------#
pear:
command: "pear"
t: 100
v: 10
j: 6
p: 0.05
extra_params: ""
prcpear: 90
#------------------------------------------------------------------------------#
# BOTH_WF: FastQC on merged/assembled fragments #
# rule: fastQCPear #
#------------------------------------------------------------------------------#
# Once the paired-end reads have been merged into one fragment, run FastQC #
# again to check their quality. Set this option to "T" (true) or "F" (false) #
# in order to execute or skip this step. #
#------------------------------------------------------------------------------#
fastQCPear: T
#------------------------------------------------------------------------------#
# BOTH_WF: QIIME #
# rule: bc_mapping_validation, extract_barcodes, extract_barcodes_unassigned, #
# split_libraries, split_libraries_rc, search_chimera, cluster_OTUs, #
# pick_representatives, assign_taxonomy, make_otu_table, summarize_taxa, #
# filter_rep_seqs, align_rep_seqs, filter_alignment, make_tree #
#------------------------------------------------------------------------------#
# Different QIIME scripts are used along the pipeline and in order to execute #
# these scripts, they need to be located on the user PATH environmental #
# variable, or QIIME's bin directory needs to be included in the parameter: #
# 'path' below. This parameter will be used by the pipeline for all the rules #
# that use a Qiime script. #
#------------------------------------------------------------------------------#
qiime:
path: ""
#------------------------------------------------------------------------------#
# BOTH_WF: R #
# rules: correct_barcodes, correct_barcodes_unassigned, histogram_chart #
#------------------------------------------------------------------------------#
# R is used by different rules within the pipeline. In order to run these #
# rules, CASCABEL uses the Rscript command. #
# Here you can change the command to call Rscript. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - command Rscript command, [default "Rscript"]. #
#------------------------------------------------------------------------------#
Rscript:
command: "Rscript"
#------------------------------------------------------------------------------#
# BOTH_WF: JAVA #
# rules: write_dmx_files, degap_alignment, remap_clusters #
#------------------------------------------------------------------------------#
# Java is used by different rules within the pipeline. In order to run java, #
# the pipeline needs to know how to invoke it. #
# Here you can change the command to invoke java. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - command java command and or path to binaries if needed [default: "java"]#
#------------------------------------------------------------------------------#
java:
command: "java"
#------------------------------------------------------------------------------#
# BOTH_WF: Demultiplex input files #
# rule: write_dmx_files, correct_barcodes #
#------------------------------------------------------------------------------#
# Cascabel optionally performs library demultiplexing for barcoded reads. #
# This feature can be turned ON/OFF with the following options: #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - demultiplex "T". If the pipeline is going to demultiplex the input #
# files. In this case, the metadata file has to be #
# provided. #
# "F". If the input files are already demultiplexed. #
# - create_fastq_files "T" or "F". If 'demultiplex' = T and this is also T, #
# the pipeline will create demultiplexed fastq files per #
# sample. #
# - remove_bc The demultiplexed fastq files are created using the #
# raw data, thus they may contain artificial sequences #
# like the barcodes. This option trims the first N bases #
# from each read. #
# - order_by_strand During demultiplexing it is possible to identify the #
# current strand of the read according to the barcode. #
# Set this option to "T" in order to interchange FW with #
# RV reads when barcodes are found on opposite strands. #
# - add_unpair "T". Un-assembled reads are also included within the #
# demultiplexing process in order to being assigned to #
# their samples. #
# "F". Only carry out the demultiplexing process with #
# paired reads. #
# - dmx_params Parameters to pass on to the demultiplexing script. #
# For example, the user can change the prefix and suffix #
# of the output files. To see the available parameters, #
# run: java -cp Scripts DemultiplexQiime. #
# - bc_mismatch Number of allowed mismatches for barcode correction #
# (computed using the Levenshtein distance). #
# bc_mismatch = 0: don't allow mismatches in the barcode#
# bc_mismatch > 0: correct bc_mismatch bases at maximum.#
# - bcc_params: Extra parameters for the barcode corrector tool. #
# Use: java -jar Scripts/BarcodeCorrector.jar to see #
# available options. #
# - create_tag_pairs_heatmap: "T" The barcode correction tool is useful for #
# detecting tag jumps. Set this to true "T" to create a #
# heatmap with the different tag pairs along a paired-end#
# library. #
#------------------------------------------------------------------------------#
demultiplexing:
demultiplex: "T"
create_fastq_files: "T"
remove_bc: 12
order_by_strand: "T"
add_unpair: "T"
dmx_params: "--remove-header"
bc_mismatch: 0
bcc_params: ""
create_tag_pairs_heatmap: "T"
#------------------------------------------------------------------------------#
# BOTH_WF: Remove adapters / primers #
# rule: cutadapt #
#------------------------------------------------------------------------------#
# This rule runs Cutadapt. Cutadapt searches for primers in the reads and #
# removes them when it finds any. #
# For details, see: http://cutadapt.readthedocs.io/en/stable/guide.html #
# - remove "F" do not remove primers. #
# "CFG" Remove adapters using values from the configuration file #
# "METADATA" Adapters are taken from the metadata/mapping file #
# [default: F]. #
# ** fw_primer Sequence to be removed from forward reads. It accepts #
# IUPAC wildcards, e.g., "^GTGYCAGCMGCCGCGGTAA". #
# Only use it whit - remove: "CFG" - #
# ** rv_primer Sequence to be removed from reverse reads. It accepts #
# IUPAC wildcards, e.g., "^GGACTACNVGGGTWTCTAAT". #
# Only use it whit - remove: "CFG" - #
# ** min_overlap Minimum overlap between a primer and a sequence to be #
# identified. #
# ** min_len Minimum sequence length after removing primes. #
# ** min_prc Minimum percentage of reads passing filters for #
# Cascabe to stop the workflow. [def. 50%] #
# ** threads Number of cpus for cutadapt. [def. 10] #
# ** extra_params Any extra parameter that you wish to supply to the #
# cutadapt command. #
# By default Cascabel includes the values #
# "--discard-untrimmed" and "--match-read-wildcards". The former#
# will discard reads without adapters and the latest interprets #
# IUPAC wildcards in the reads. Notice that when using #
# "--discard-untrimmed" Cascabel will redirect the untrimmed #
# sequences to an output file so no need to add Cutadapt's #
# option "--untrimmed-output". #
#------------------------------------------------------------------------------#
primers:
remove: "F"
fw_primer: ""
rv_primer: ""
min_overlap: "5"
min_length: "100"
min_prc: 50
threads: 10
extra_params: "--discard-untrimmed --match-read-wildcards "
#------------------------------------------------------------------------------#
# BOTH_WF: Extract barcodes #
# rules: extract_barcodes, extract_barcodes_unassigned #
#------------------------------------------------------------------------------#
# This rule extract barcodes from the reads, and generates two files: one #
# with only the extracted barcodes and a second one with the sequences without #
# the barcodes. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - c This parameter allows to choose the barcode configuration: #
# "barcode_single_end". The merged fragments start with the #
# barcode sequence. #
# "barcode_paired_stitched". Input has barcodes at the #
# beginning and end of the merged fragment. #
# "barcode_paired_end". This option is not valid here since #
# the reads have already been merged to one fragment. #
# - bc_length If 'c' is "barcode_paired_stitched" use both: #
# --bc1_len X and --bc2_len Y. #
# If 'c' is "barcode_single_end" use only --bc1_len X. #
# X and Y are the nucleotide lengths of the barcodes. #
# - extra_params Extra parameters. See extract_barcodes.py -help. #
#------------------------------------------------------------------------------#
ext_bc:
c: "barcode_paired_stitched"
bc_length: "--bc1_len 12 --bc2_len 12"
extra_params: ""
#------------------------------------------------------------------------------#
# BOTH_WF: Split libraries #
# rule: split_libraries, split_libraries_rc #
#------------------------------------------------------------------------------#
# These rules performs demultiplexing of Fastq sequence data where barcodes and#
# sequences are contained in two separate fastq files. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - q Maximum unacceptable Phred quality score. e.g., for Q20 and #
# better, specify -q 19). [default: 19] #
# - r Maximum number of consecutive low quality base calls allowed #
# before truncating a read. [default: 3] #
# - barcode_type The type of barcode used. This can be an integer, e.g., "6" #
# or “golay_12” for golay error-correcting barcodes. #
# - extra_params Any extra parameter. Run split_libraries_fastq -h to see all #
# options. #
#------------------------------------------------------------------------------#
split:
q: "19"
r: "5"
barcode_type: "24"
extra_params: " --phred_offset 33"
#------------------------------------------------------------------------------#
# ASV_WF: dada2 trim and filter reads #
# rule: dada2Filter #
#------------------------------------------------------------------------------#
# These parameters take effect during the quality trimming and filtering steps #
# implemented within dada2 using the filterAndTrim() function. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - generateQAplots Cascabel already generates FastQC reports, however dada2 #
# quality plots can be generated by passing "T". #
# Default "T". #
# - truncFW Truncate forward reads after truncFW bases. Reads shorter#
# than this are discarded. Your reads must still overlap #
# after truncation in order to merge them later. #
# - tuncRV Truncate reverse reads after truncRV bases. Reads shorter#
# than this are discarded. Your reads must still overlap #
# after truncation in order to merge them later. #
# - maxEE_FW After truncation, forward reads with higher than #
# maxEE_FW "expected errors" will be discarded. #
# - maxEE_RV After truncation, reverse reads with higher than #
# maxEE_RV "expected errors" will be discarded. #
# - cpus Number of threads|cpus to be used. #
# - extra_params Any extra parameter belonging to dada2's function #
# filterAndTrim(). The value passed through this variable #
# is send directly to the function in R. Therefore, if your#
# extra_params involves more than one argument, separate #
# them with commas i.e. Suppose you want to pass truncQ=2 #
# and rm.phix=TRUE arguments. In R the function may look #
# like this: filterAndTrim(...,truncQ=2, rm.phix=TRUE) #
# Thus, the extra_params should look like the following: #
# "truncQ=2, rm.phix=TRUE". #
# Note from dada2's tutorial: #
# "The standard filtering parameters are starting points, not set in stone. #
# If you want to speed up downstream computation, consider tightening maxEE. #
# If too few reads are passing the filter, consider relaxing maxEE, perhaps #
# especially on the reverse reads (maxEE_RV), and reducing the truncLen to #
# remove low quality tails. Remember though, when choosing truncLen for #
# paired-end reads you must maintain overlap after truncation in order to #
# merge them later." #
#------------------------------------------------------------------------------#
dada2_filter:
generateQAplots: "T"
truncFW: 0
truncRV: 0
maxEE_FW: 3
maxEE_RV: 5
cpus: 10
extra_params: ",truncQ=2, rm.phix=TRUE"
#------------------------------------------------------------------------------#
# ASV_WF: dada2 merge pairs #
# rule: run_dada2 #
#------------------------------------------------------------------------------#
# After denoising forward and reverse reads, they are assembled with the #
# mergePairs() function from dada2 package. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - minOverlap Default 12. The minimum length of the overlap required for #
# merging the forward and reverse reads. #
# - maxMismatch Default 0. The maximum mismatches allowed in the overlap #
# region. #
#------------------------------------------------------------------------------#
dada2_merge:
minOverlap: 12
maxMismatch: 0
#------------------------------------------------------------------------------#
# ASV_WF: dada2 generates ASV #
# rule: run_dada2 #
#------------------------------------------------------------------------------#
# Find true Amplicon Sequence Variants by denoising sequences with the dada() #
# function. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - generateErrPLots Default "T". Use "T" to generate the error plots #
# generated with the learned errors from the #
# learnErrors() function. #
# - nbases Number of bases used for learning the errors. [Def 1e8]#
# - pool If pool = "TRUE", the algorithm will pool together all #
# samples prior to sample inference. If pool = "FALSE", #
# sample inference is performed on each sample #
# individually. If pool = "pseudo", the algorithm will #
# perform pseudo-pooling between individually processed #
# samples. #
# - chimeras If "T" The samples in a sequence table are #
# independently checked for bimeras, and a consensus #
# decision on each sequence variant is made. #
# - extra_params Any extra parameter belonging to dada2's function #
# dada(). e.g., "selfConsist=FALSE". #
# The value passed through this variable is send directly#
# to the function in R. Therefore, if your extra_params #
# involves more than one argument, separate them with #
# commas, as the example above. #
#------------------------------------------------------------------------------#
dada2_asv:
generateErrPlots: "T"
nbases: "1e8"
pool: "pseudo"
cpus: 15
chimeras: "T"
chimeras_method: "consensus"
chimeras_taxonomy: "T"
extra_params: "selfConsist=FALSE"
#------------------------------------------------------------------------------#
# ASV_WF: Assign taxonomy #
# rule: run_dada2 #
#------------------------------------------------------------------------------#
# The taxonomy assignation for the ASVs is performed within the dada2 package, #
# by using the assignTaxonomy() function which uses the RDP Naive Bayesian #
# Classifier algorithm. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - db Full path to reference database training files: #
# https://benjjneb.github.io/dada2/training.html #
# - extra_params Any extra parameter belonging to dada2's function #
# assignaxonomy(). e.g., "minBoot=45, tryRC=TRUE". #
# The value passed through this variable is send directly to #
# the function in R. Therefore, if your extra_params involves #
# more than one argument, separate them with commas, as the #
# example above. #
# An important parameter to consider is minBoot, which sets #
# the minimum bootstrapping support required to return a #
# taxonomic classification. The original paper recommended a #
# threshold of 50 for sequences of 250nts or less. We set 70 #
# as the default and advise to increase it to 80 for longer #
# fragments. #
# - seed RDP's bootstrap seed #
# - add_sp Arguments for assigning genus-species binomials to the input#
# sequences by exact matching against a reference fasta using #
# the addSpecies() function from dada2. #
# + add "T" or "F" Try to assign or not genus-species binomials. #
# + db_sps If add = "T" full path to the reference species file. #
# + extra_params Any extra parameter belonging to dada2's function #
# addSpecies(). e.g., "allowMultiple=TRUE". #
# The value passed through this variable is send directly to #
# the function in R. Therefore, if your extra_params involves #
# more than one argument, separate them with commas, as the #
# example above. #
# NOTE: Find available databases at: #
# https://benjjneb.github.io/dada2/training.html #
# Files "silva_nr_v132_train_set.fa.gz" & "silva_species_assignment_v132.fa.gz"#
# are just a suggestion and they must be downloaded from previous reference #
# prior to its use. rename {FULL_PATH} to the path were your files were #
# downloaded. #
#------------------------------------------------------------------------------#
dada2_taxonomy:
db: "/Absolute/Path/to/db/i.e./silva_nr_v132_train_set.fa.gz"
extra_params: "minBoot=70"
seed: 4249
add_sps:
add: "F"
db_sps: "/Absolute/Path/to/sp_db/i.e./silva_species_assignment_v132.fa.gz"
extra_params: "allowMultiple=TRUE"
#------------------------------------------------------------------------------#
# OTU_WF: Align reads vs a reference database #
# rule: align_vs_reference #
#------------------------------------------------------------------------------#
# In some cases you may want to align the reads against a reference database #
# before generating OTUs. This facilitates removing technical sequence #
# (primers, adapters) at the beginning and/or end of the reads, filters #
# potential chimeric sequences or sequences of no interest, which the primers #
# amplified but which are not part of the study (e.g. reads of human origin). #
# Therefore, this alignment step can improve the OTU clustering and further #
# taxonomy assignation. In order to run the alignment here, use 'align': "T" #
# and bear in mind that whenever you do this, you should only do it with small #
# to medium sized data bases, because sequence alignment is computationally #
# costly. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - mothur_cmd: Enter the command to call mothur [default: "mothur"]. #
# - align: "T" or "F" to run or skip this rule respectively. #
# - dbAligned: Database to perform the alignment with. #
# - cpus: Number of CPUs to perform the alignment with. #
#------------------------------------------------------------------------------#
align_vs_reference:
mothur_cmd: "mothur"
align: "F"
dbAligned: ""
cpus: 4
#------------------------------------------------------------------------------#
# OTU_WF: Identify chimeric sequences #
# rule: search_chimera #
#------------------------------------------------------------------------------#
# This rule will be executed if and only if the option 'search' is set to "T"! #
# Chimeric sequences are predicted using usearch61. This algorithm performs #
# both de novo (abundance based) chimera and reference based detection. #
# Unclustered sequences are used as input rather than a representative sequence#
# set, as the sequences will be clustered to get abundance data. #
# The results are all input sequences not flagged as chimeras. #
# This rule implements different methods for identifying chimeras, below #
# more details about the available options. #
# For details about usearch, see: http://drive5.com/usearch/usearch_docs.html #
# If you are using usearch61, bear in mind that you can use a reference #
# database via extra_params, e.g., "-r /path/to/gold_db/gold.fa". #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - method Select the method for chimera identification: #
# - "usearch61" This algorithm performs both de novo #
# (abundance based) chimera and reference based detection. #
# This method uses the usearch implementation within qiime's#
# script identify_chimeric_seqs.py. #
# To use reference based detection supply the reference #
# database via extra_params, e.g.:"-r /dbs/gold_db/gold.fa" #
# For details, see: http://drive5.com/usearch/usearch_docs.html#
# - "uchime_denovo" detect chimeras de novo (uses vsearch) #
# - "uchime_ref" detect chimeras using a reference database. #
# In this later case, user MUST suply extra_params: #
# "--db </full/path/to/db.fasta>" (i.e., path to gold db). #
# - threads Number of threads to use. #
# - extra_params Any extra parameter. It is recommended to run the chimeric #
# search against a chimera database, e.g., #
# "-r /export/data/databases/gold_db/gold.fa" for "usearch61" #
# or "--db /export/data/databases/gold_db/gold.fa" for #
# "uchime_ref". #
# - search "T" | "F" (true or false) to execute chimera checking or not.#
#------------------------------------------------------------------------------#
chimera:
search: "F"
method: "uchime_ref"
threads: 10
extra_params: "--db /Absolute/Path/gold.fa"
#------------------------------------------------------------------------------#
# BOTH_WF: Length filtering for OTU or ASV #
# rule: remove_short_long_reads #
#------------------------------------------------------------------------------#
# This rule executes a script in order to filter the reads based on their #
# length. #
# First, this script generates a histogram based on the read lengths #
# distribution. Next, this histogram is used by the same script in different #
# ways depending on the pipeline's execution mode (interactive or automatic). #
# If the pipeline is executed in "Interactive" mode, it will always stop at #
# this step and let the user choose between the following options: #
# * Use the values specified in the configuration file (the ones from the #
# parameters specified here, 'longs' and 'shorts'). #
# * Filter the reads based on the median of the sequence length #
# distribution +/- an offset value. #
# * Do not filter any sequence. #
# * Stop the pipeline. #
# If the pipeline is executed in non-interactive/automatic mode, the pipeline #
# will not stop and the reads will be filtered according to the #
# 'non_interactive_behaviour' value. #
# - non_interactive_behaviour Behaviour for the non-interactive/automatic #
# mode. Valid options are: #
# * CFG: use the values from the configuration #
# file ('longs' and 'shorts'). #
# * AVG: use the values from the median #
# distribution. #
# * NONE: do not filter any read. #
# [default: CFG] #
# - offset value used to determine the bounds for the #
# filtering when 'interactive' is set to 'F' and #
# non_interactive_behaviour is equal to 'AVG'. #
# [default: 10]. #
# - longs Maximum read length. #
# - shorts Minimum read length #
#------------------------------------------------------------------------------#
# REMARK: If your library contains more than one expected fragment length, you #
# can either: #
# A) Do not filter any length. #
# B) Use inclusive boundaries. #
# C) Rerun the pipeline for all the expected fragment lengths, e.g., #
# "--forcerun remove_short_long_reads". In this case, do not forget to #
# backup previous results, otherwise, they will be overwritten. #
#------------------------------------------------------------------------------#
rm_reads:
non_interactive_behaviour: "AVG"
offset: 10
longs: 260
shorts: 220
#------------------------------------------------------------------------------#
# OTU_WF: Dereplicate #
# rule: dereplicate, pick_derep_representatives #
#------------------------------------------------------------------------------#
# This parameters allows the user to dereplicate the sequences over the FULL #
# LENGTH (100% identity) before applying any OTU picking strategy. This is #
# advised for very large datasets, when OTU picking methods take too long or #
# have memory issues. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - dereplicate Dereplicate sequences over their full length (F/T) #
# [default: "F"]. #
# - vsearch_cmd Command for calling vsearch [default: "vsearch"]. #
# - min_abundance Minimum abundance for output from dereplication. #
# - strand: plus|both, search "plus" or "both" strands #
# [default: "both"]. #
# - extra_params Dereplication is performed using vsearch, you can add #
# different options described by vsearch --help. #
#------------------------------------------------------------------------------#
derep:
dereplicate: "T"
vsearch_cmd: "vsearch"
min_abundance: 1
strand: "both"
extra_params: ""
#------------------------------------------------------------------------------#
# OTU_WF: OTU picking #
# rule: cluster_OTUs #
#------------------------------------------------------------------------------#
# The OTU picking step assigns similar sequences to operational taxonomic #
# units, or OTUs, by clustering sequences based on a user-defined similarity #
# threshold. Sequences which are similar at or above the threshold level are #
# taken to represent the presence of a taxonomic unit (e.g., approximately at #
# genus level, when the similarity threshold is set at 0.94) in the sequence #
# collection. Swarm takes a different clustering approach which does not #
# require setting a threshold. Instead, clusters are formed using sequence #
# graphs and abundance. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - s Sequence similarity threshold between 0 and 1. This applies #
# for the following methods 'm': uclust, uclust_ref, usearch, #
# usearch_ref, usearch61, usearch61_ref, sumaclust, and #
# sortmerna. [default:0.97]. #
# For Swarm this options must be supplied and it is equal to #
# the "distance" (option -d), and it take integer values. #
# [default:1] #
# - m OTU picking method. Valid choices are: sortmerna, mothur, #
# trie, uclust, uclust_ref, usearch, usearch_ref, swarm, #
# cdhit, sumaclust, prefix_suffix. #
# - extra_params Any extra parameter. Run 'pick_otus.py -h' to see all #
# options. #
# For swarm, you can also add any extra parameter. Run #
# 'swarm -h' to see the available options. When running swarm #
# with distance (-d resolution) equal to 1 we recommend to add#
# extra_param "-f" (--fastidious) to link nearby #
# low-abundance swarms. #
#------------------------------------------------------------------------------#
pickOTU:
s: "0.97"
m: "uclust"
cpus: "6"
extra_params: ""
#------------------------------------------------------------------------------#
# OTU_WF: Select representative sequences #
# rule: pick_representatives #
#------------------------------------------------------------------------------#
# After picking OTUs, this rule picks a representative sequence for each OTU. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# - m Method for picking representative sequences. Valid choices #
# are: random, longest, most_abundant, first #
# [default: "most_abundant"]. #
# Note: "first" chooses the cluster seed when picking otus with#
# uclust. #
# - extra_params Any extra parameter. Run 'pick_rep_set.py -h' to see all #
# options. #
#------------------------------------------------------------------------------#
pickRep:
m: "most_abundant"
extra_params: ""
################################################################################
# OTU_WF: Assign taxonomy #
# rule: assign_taxonomy #
#------------------------------------------------------------------------------#
# Performs taxonomy assignment for the representative sequences. #
# #
#----------------------------- PARAMS -----------------------------#
# #
# This step can be performed using three different tools (parameter 'tool'): #
# 1) VSEARCH. Compare target sequences 'db_file' to the query sequences to #
# assign taxonomy, using global pairwise alignment. #
# 2) BLAST. This uses BLAST+. #
# 3) QIIME. In this case, CASCABEL runs the assign_taxonomy.py script which #
# can use any of the following methods: #
# 3.1) BLAST. To use blast via QIIME, use "blast" as 'method' below. #
# This uses the old version of BLAST (not BLAST+). #
# This method can work with either a BLAST database, setting #
# 'dbType' to "-b" and 'dbFile' with the full path to a BLAST #
# database, or with a fasta file with 'dbTYpe' set to "-r" and #
# 'dbFile' pointing to the fasta file (full path must be used). #
# 3.2) UCLUST. A method based on sequence clustering. To use this method type #
# "uclust" as 'method' below. This method ONLY works with #
# 'dbType' set to "-r" and 'dbFile' full path to a fasta file. #
# 3.3) RDP. The Ribosomal Database Project (RDP) Classifier, a naive #
# Bayesian classifier, can rapidly and accurately classify #
# bacterial 16S rRNA sequences. To use this method, use "rdp" as #
# 'method' below. #
# NOTE: In order to run RDP it is necessary to provide the #
# classifier path. To do so, please include the following line #
# on the "extra_params": #
# '--rdp_classifier_fp /path/to/rdp_classifier-2.2.jar' #
# Due to some compatibility issues, RDP needs a custom fasta file#
# and taxonomy mapping file. You can find the appropriate files #
# at the following references: #
# 'dbFile' "/gg/gg_13_8_otus/rep_set/97_otus.fasta" #
# 'mappFile' "/gg/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt" #
# These 3 different methods can use different sets of options in #
# order to fine tune the taxonomy assignment. These options can be #
# realized with the extra_params parameter. To see what options are #
# available for each 'method', type on the command line #
# 'parallel_assign_taxonomy_<method>.py -h'. #
# - tool Which tool to use, options are "vsearch", "qiime" and "blast". #
# - map_lca For Vsearch and BLAST methods, the pipeline will map to its LCA #