Parameters files

parameters.yaml

Defines paths for both local and remote binaries and databases. A template is provided in the examples directory.

ReadSoustraction:
  db:
    vitis: '/media/data/db/ncbi/vitis/vitis'
    phiX:  '/media/data/db/ncbi/phiX/phiX174'
bin:
  bowtie: '/usr/local/bin/bowtie2'
  samtools: '/usr/bin/samtools'
  bedtools: '/usr/bin/bedtools'
  prinseq: '/usr/local/bin/prinseq-lite.pl'
  merge-paired-reads: '/home/stheil/softwares/sortmerna-2.1-linux-64/scripts/merge-paired-reads.sh'
  unmerge-paired-reads: '/home/stheil/softwares/sortmerna-2.1-linux-64/scripts/unmerge-paired-reads.sh'
  sortmerna: '/home/stheil/softwares/sortmerna/sortmerna'
servers:
  enki:
    db:
      nt: '/media/data/db/ncbi/nt/nt'
      nr: '/media/data/db/ncbi/nr/nr'
      refseq_vir_nucl: '/media/data/db/ncbi/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/media/data/db/ncbi/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/save/db/pfam/pfam_viruses_rpsdb'
      all_vir_nucl: '/media/data/db/ncbi/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/media/data/db/ncbi/all_vir/all_vir_prot.faa'
  genotoul:
    adress: 'genotoul.toulouse.inra.fr'
    username: 'stheil'
    db:
      nr: '/bank/blastdb/nr'
      nt: '/bank/blastdb/nt'
      refseq_vir_nucl: '/save/stheil/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/save/stheil/db/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/save/db/pfam/pfam_viruses_rpsdb'
      all_vir_nucl: '/home/stheil/save/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/stheil/save/db/all_vir/all_vir_prot.faa'
    scratch: '/work/stheil'
    bin:
      blastx: 'blastx+'
      blastn: 'blastn+'
  genologin:
    adress: 'genologin.toulouse.inra.fr'
    username: 'mlefebvre'
    db:
      nr: '/bank/ncbi/blast/nr/current/blast/nr'
      nt: '/bank/ncbi/blast/nr/current/blast/nt'
      dmd_nr: '/bank/diamonddb/nr'
      refseq_vir_nucl: '/save/mlefebvre/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/save/mlefebvre/db/refseq_vir/viral.protein.faa'
      pfam: '/home/mlefebvre/work/pfam/Pfam'
      all_vir_nucl: '/home/mlefebvre/save/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/mlefebvre/save/db/all_vir/all_vir_prot.faa'
    scratch: '/work/mlefebvre'
    bin:
      blastx: 'blastx'
      blastn: 'blastn'
  avakas:
    adress: 'avakas.mcia.univ-bordeaux.fr'
    username: 'stheil'
    db:
      nr: '/home/stheil/db/nr/nr'
      nt: '/home/stheil/db/nt/nt'
      all_vir_nucl: '/home/stheil/scratch/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/stheil/scratch/db/all_vir/all_vir_prot.faa'
      refseq_vir_nucl: '/home/stheil/scratch/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/home/stheil/scratch/db/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/db/pfam/pfam_viruses_rpsdb'
    scratch: '/scratch/stheil'
    bin:
      blastx: 'blastx'
      blastn: 'blastn'
Diamond:
  db:
    all_vir_prot: /media/db/ncbi/all_vir/all_vir_prot
SortMeRna:
  db:
    silva-arc-16s-id95: /media/data/db/rRNA_databases/silva-arc-16s-id95
    silva-arc-23s-id98: /media/data/db/rRNA_databases/silva-arc-23s-id98
    silva-bac-16s-id90: /media/data/db/rRNA_databases/silva-bac-16s-id90
    silva-bac-23s-id98: /media/data/db/rRNA_databases/silva-bac-23s-id98
    silva-euk-18s-id95: /media/data/db/rRNA_databases/silva-euk-18s-id95
    silva-euk-28s-id98: /media/data/db/rRNA_databases/silva-euk-28s-id98

step.yaml

Defines the steps that the pipeline will execute. A template is provided in the /examples directory.

Step names correspond to a python module that will launch the step. Step names are split based on the ‘_’ character so you can launch multiple instance. For example you might want to launch blastx and blastn, so step names could be ‘Blast_N’ and ‘Blast_X’. What is after the underscore do not matters, it is just used to differanciate the two steps.

Special words in bracket are used as substitution string. - (file), (file1) and (file2) - (SampleID) - (library)

ReadSoustraction_phiX:
  i1: (file1)
  i2: (file2)
  db: phiX
  o1: (library)_phiX.r1.fq
  o2: (library)_phiX.r2.fq
  sge: True
  n_cpu: 5
  iter: library
Demultiplex:
  i1: (library)_phiX.r1.fq
  i2: (library)_phiX.r2.fq
  adapters: adapters.fna
  middle: 1
  min_qual: 20
  polyA: True
  min_len: 70
  iter: library
  sge: True
DemultiplexHtml:
  csv: (library)_demultiplex.stats.csv
  id: (library)
  out: stat_demultiplex
  iter: global
  sge: True
Normalization:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  o1: (SampleID)_truePairs_norm_r1.fq
  o2: (SampleID)_truePairs_norm_r2.fq
  num: 40000
  iter: sample
  n_cpu: 5
  sge: True
drVM:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  n_cpu: 20
  identity: 70
  min_len: 300
  sge: True
Assembly_idba:
  prog: idba
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_idba.scaffold.fa
  sge: True
Assembly_spades:
  prog: spades
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_spades.scaffold.fa
  sge: True
Map_idba:
  contigs: (SampleID)_idba.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_idba.scaffold.bam
  rn: (SampleID)_idba.scaffold.rn
  sge: True
  n_cpu: 16
Map_spades:
  contigs: (SampleID)_spades.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_spades.scaffold.bam
  rn: (SampleID)_spades.scaffold.rn
  sge: True
  n_cpu: 16
Diamond:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  n_cpu: 10
  sge: True
  score: 50
  evalue: 0.0001
  qov: 50
  hov: 5
  db: all_vir_prot
Diamond_singletons_nr:
  contigs: (SampleID)_idba.scaffold.fa
  db: nr
  ising: (SampleID)_singletons.fq
  n_cpu: 10
  sge: True
  out: (SampleID)_singletons_test.nr.dmdx.xml
  evalue: 0.001
  iter: sample
  score: 10
  qov: 10
Diamond2blast:
  i: (SampleID)_idba.scaffold.dmdx.nr.csv
  contigs: (SampleID)_idba.scaffold.dmdx2bltx.fa
  out: (SampleID)_idba.scaffold.dmdx2bltx.nr.xml
  type: blastx
  db: nr
  evalue: 0.0001
  server: genologin
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_allvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: all_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  sge: True
  num_chunk: 1000
  tc: 50
Blast_nr:
  type: blastx
  contigs: (SampleID)_idba.scaffold.fa
  db: nr
  out: (SampleID)_idba.scaffold.bltx.nr.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_refvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: refseq_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  sge: True
Blast_singleton_nr:
  type: blastx
  contigs: (SampleID)_singletons.fa
  db: nr
  out: (SampleID)_singletons.bltx.nr.xml
  evalue: 0.0001
  server: genologin
  n_cpu: 8
  tc: 10
  num_chunk: 1000
  sge: True
Blast_RPS:
  type: rpstblastn
  contigs: (SampleID)_idba.scaffold.fa
  db: pfam
  evalue: 0.0001
  out: (SampleID)_idba.scaffold.rps.pfam.xml
  server: genotoul
  n_cpu: 8
  sge: True
Blast2ecsv_allvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  vs: True
  out: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  sge: True
  type: TBLASTX
  score: 50
  qov: 20
Blast2ecsv_refvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.0001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  vs: True
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  sge: True
  type: TBLASTX
  score: 50
  qov: 50
  hov: 5
Blast2ecsv_nr:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.bltx.nr.xml
  vs: True
  out: (SampleID)_idba.scaffold.bltx.nr.csv
  sge: True
  type: BLASTX
  score: 50
  qov: 5
  hov: 5
Blast2ecsv_dmd:
  evalue: 0.01
  fhit: True
  pm: global
  if: xml
  r: True
  b: (SampleID)_dmd.xml
  out: (SampleID)_dmd.allVirProt.csv
  sge: True
  type: BLASTX
  pd: True
Blast2ecsv_dmdx_singletons_nr:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_singletons.nr.dmdx.xml
  vs: True
  out: (SampleID)_singletons_test.nr.dmdx.csv
  sge: True
  type: DIAMONDX
  pd: True
Rps2ecsv:
  b: (SampleID)_idba.scaffold.rps.pfam.xml
  out: (SampleID)_idba.scaffold.rps.pfam.csv
  evalue: 0.0001
  sge: True
Ecsv2excel:
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  r:  (SampleID)_idba.scaffold.rps.pfam.csv
  out:  (SampleID)_idba.scaffold.xlsx
  sge: True
Ecsv2compare:
  b1: (SampleID)_idba.scaffold.bltx.nr.csv
  r:  (SampleID)_idba.scaffold.rps.pfam.csv
  out:  (SampleID)_idba.scaffold.comparison.xlsx
  sge: True
Blast2hist:
  id1: (SampleID)_refseq_tbltx
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  id2: (SampleID)_allvir_tbltx
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  id3: (SampleID)_nr_bltx
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  id4: (SampleID)_dmd
  b4: (SampleID)_dmd.allVirProt.csv
  iter: global
  sge: True
  out: blast_hist
Ecsv2krona:
  id1: (SampleID)_refseq_tbltx
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  x1: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  id2: (SampleID)_allvir_tbltx
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  x2: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  id3: (SampleID)_nr_bltx
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  x3: (SampleID)_idba.scaffold.bltx.nr.xml
  outdir: krona_blast
  out: blast.global.krona.html
  data: both
  r: True
  c: identity
  iter: global
  sge: True
Ecsv2krona_dmd:
  id1: (SampleID)
  b1: (SampleID)_dmd.allVirProt.csv
  outdir: krona_diamond
  out: global_krona_dmd.html
  data: contig
  r: True
  c: identity
  iter: global
  sge: True
Automapper_nr:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.bltx.nr.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_nr
  sge: True
  ref: nt
Automapper_allvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_allvir
  sge: True
  ref: all_vir_nucl
Automapper_refseqTX:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_refseq
  sge: True
  ref: refseq_vir_nucl
Rps2tree:
  pfam: (SampleID)_idba.scaffold.rps.pfam.csv
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.bltx.nr.csv
  id: (SampleID)
  out: rps2tree_global
  min_prot: 100
  viral_portion: 0.3
  perc: 90
  iter: global
  sge: True
Getresults:
  global_dir1: rps2tree_global
  global_dir2: krona_blast
  global_dir3: krona_diamond
  global_dir4: blast_hist
  global_dir5: stat_demultiplex
  sample_dir1: (SampleID)_autoMapper_nr
  sample_dir2: (SampleID)_autoMapper_refseq
  sample_dir3r: (SampleID)_autoMapper_allvir
  sample_file1: (SampleID)_idba.scaffold.xlsx
  sample_file2: (SampleID)_idba.scaffold.fa
  sample_file3: (SampleID)_spades.scaffold.fa
  sample_file4: (SampleID)_truePairs_r1.fq
  sample_file5: (SampleID)_truePairs_r2.fq
  out: results

map.txt

The map file describe the experiment. It is a tabulated file with the first line containing headers starting with ‘#’. It must contain at least two column: SampleID and file. A template is provided in the examples directory. This is a minimum map.txt file:

#SampleID	mid	common	file1	file2	library
ds2016-121	AACCGCAA	TGTGTTGGGTGTGTTTGG	Lib1_phiX.R1.fastq	Lib1_phiX.R2.fastq	lib1
ds2016-132	AACTAGTA	TGTGTTGGGTGTGTTTGG	Lib1_phiX.R1.fastq	Lib1_phiX.R2.fastq	lib1
ds2016-122	AGGCGCCT	TGTGTTGGGTGTGTTTGG	Lib2_phiX.R1.fastq	Lib2_phiX.R2.fastq	lib2
ds2016-133	ATTAGCTA	TGTGTTGGGTGTGTTTGG	Lib2_phiX.R1.fastq	Lib2_phiX.R2.fastq	lib2
ds2016-123	CAAGAGTT	TGTGTTGGGTGTGTTTGG	Lib3_phiX.R1.fastq	Lib3_phiX.R2.fastq	lib3
ds2016-55	CAAGCAGG	TGTGTTGGGTGTGTTTGG	Lib3_phiX.R1.fastq	Lib3_phiX.R2.fastq	lib3
ds2016-124	CCAACCAT	TGTGTTGGGTGTGTTTGG	Lib4_phiX.R1.fastq	Lib4_phiX.R2.fastq	lib4
ds2016-56	CGATAGAG	TGTGTTGGGTGTGTTTGG	Lib4_phiX.R1.fastq	Lib4_phiX.R2.fastq	lib4
ds2016-125	GCTCTACC	TGTGTTGGGTGTGTTTGG	Lib5_phiX.R1.fastq	Lib5_phiX.R2.fastq	lib5
ds2016-57	GCTGCGGT	TGTGTTGGGTGTGTTTGG	Lib5_phiX.R1.fastq	Lib5_phiX.R2.fastq	lib5
ds2016-58	GGCCAGAA	TGTGTTGGGTGTGTTTGG	Lib6_phiX.R1.fastq	Lib6_phiX.R2.fastq	lib6
ds2016-10	GGTACTCC	TGTGTTGGGTGTGTTTGG	Lib6_phiX.R1.fastq	Lib6_phiX.R2.fastq	lib6
ds2016-11	TCGGATGC	TGTGTTGGGTGTGTTTGG	Lib7_phiX.R1.fastq	Lib7_phiX.R2.fastq	lib7
ds2015-149	TCTATGAC	TGTGTTGGGTGTGTTTGG	Lib7_phiX.R1.fastq	Lib7_phiX.R2.fastq	lib7
ds2015-162	TTCTGGCT	TGTGTTGGGTGTGTTTGG	Lib8_phiX.R1.fastq	Lib8_phiX.R2.fastq	lib8
ds2015-170	TTGCGTCA	TGTGTTGGGTGTGTTTGG	Lib8_phiX.R1.fastq	Lib8_phiX.R2.fastq	lib8

You can add categories for each sample so they can be used when coloring sequences in trees from the Rps2tree module. One library can be attributed to multiple samples, as shown in the example. Thus the demultiplexing step will be able to differentiate each sample and separate them.