_images/INRA_logo.jpg _images/ubx-logo.png

Welcome to VirAnnot’s documentation!

VirAnnot was build to ease the assembly, blast search and taxonomic annotation of metagenomic multi-sample datasets. It is used in the Virologie team of UMR1332 BFP laboratory at INRA.

It was designed to identify viruses in plants but it can be used to assemble and then annotate any sequences with the NCBI taxonomy.

NR and NT must be present localy and/or on distant servers and NCBI taxonomy is loaded in SQLITE database with a provided script.

Blast step is the most time consumming step and the use of large computer cluster is clearly an advantage. Here we used two clusters :

CURTA at Bordeaux University.

GENOTOUL at Toulouse INRA datacenter.

Pipeline general scheme:

_images/dia-intro.png

Guide

Prerequisite

External databases

Perl external libraries

  • Getopt::Long
  • File::Basename
  • DBI
  • Data::Dumper
  • Bioperl
  • Color::Rgb
  • List::Util
  • Spreadsheet::WriteExcel
  • Log::Log4perl
  • DBD::SQLite
  • SQL::SplitStatement
  • Math::Round
  • String::Random
  • Bio::SearchIO:blastxml
  • Bio::SeqIO

Perl included libraries

  • Tools::Fasta
  • Tools::Fastq
  • Tools::Blast
  • Tools::Taxonomy
  • Logger::Logger

Python library

  • os
  • call
  • logging
  • random
  • string
  • argparse
  • re
  • sys
  • Bio
  • time
  • glob
  • shutil
  • yaml
  • csv
  • importlib
  • matplotlib

Install

Add tools and launchers folders to your $PATH.

export PATH=/path/to/tools:/path/to/launchers:$PATH

Add lib folder to your $PERL5LIB.

export PERL5LIB=/path/to/lib:$PERL5LIB

Database

NCBI taxonomy and the homemade per domain Pfam taxonomy are stored in a simple SQLite database.

Schema:

_images/taxo_sql.svg

NCBI Taxonomy

  • Download and extract NCBI taxonomy files.
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz ; gunzip taxdump.tar.gz; tar -xf taxdump.tar;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz ; gunzip prot.accession2taxid.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz ; gunzip nucl_gb.accession2taxid.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz ; gunzip dead_prot.accession2taxid.gz;
cat prot.accession2taxid dead_prot.accession2taxid > acc2taxid.prot
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz ; gunzip nucl_wgs.accession2taxid.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz ; gunzip dead_wgs.accession2taxid.gz
cat nucl_wgs.accession2taxid nucl_gb.accession2taxid dead_wgs.accession2taxid > acc2taxid.nucl
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz; gunzip dead_nucl.accession2taxid.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz; gunzip gi_taxid_prot.dmp.gz;

Optionally you can combine multiple accession2taxid file with a simple cat. But keep separated nucl and prot accessions as they will be loaded in two different tables.

Launch the loadTaxonomy.pl script that will create the sqlite database. The script needs two provided sqlite files: taxonomyIndex.sql and taxonomyStructure.sql that describe the database struture. All these files are in virAnnot/db/.

./loadTaxonomy.pl -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_prot gi_taxid_prot.dmp

PFAM taxonomy

The pipeline modules rps2ecsv and rps2tree need taxonomic information of the PFAM domains to work. You need to extract these informations and load it into the sqlite database. Be carefull, the files you will download have a size of ~900Mo.

  • Download and extract Pfam FASTA files:
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/fasta.tar.gz
tar -xzf fasta.tar.gz;
mkdir pfam
mv pfam*.FASTA fasta/
cd pfam/
  • Extract taxonomic information for each sequence of each PFAM domain and store it in *.tax.txt files:
ls -1 pfam*.FASTA | sed 's,^\(.*\)\.FASTA,./gi2taxonomy.pl -i & -o \1.tax.txt -db taxonomy.tmp.sqlite -r,' | bash
  • Create a file of file for the *.tax.txt files:
listPath.pl -d . | grep 'tax.txt' > idx
  • Compute taxonomy statistic for each domain and create a sql file to load into the database:
taxo_profile_to_sql.pl -i idx > taxo_profile.sql
  • Load into the database:
sqlite3 taxonomy.tmp.sqlite < taxo_profile.sql
  • Modify path to the database by editing the following scripts:
./tools/rps2ecsv.pl:my $db = '/path/to/taxonomy.tmp.sqlite';
./tools/2krona_new.pl:my $db = '/path/to/taxonomy.tmp.sqlite';
./tools/ecsv2krona.pl:my $db = '/path/to/taxonomy.tmp.sqlite';
./tools/blast2ecsv.pl:my $db = '/path/to/taxonomy.tmp.sqlite';
./tools/rps2tree.pl:my $db = '/path/to/taxonomy.tmp.sqlite';
./tools/autoMapper.pl:  'taxonomyDatabase'  => '/path/to/taxonomy.tmp.sqlite'

NCBI Blast database

NCBI non redundant databases are very large and similarity search using Blast is an intensive task. I recommand to use those databases on computer clusters.

  • Download NCBI nr et nt Blast files.
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz

Modify the parameters.yaml to fit your configuration.

servers:
genotoul:
  adress: 'genotoul.toulouse.inra.fr'
  username: 'stheil'
  db:
    nr: '/bank/blastdb/nr'
    nt: '/bank/blastdb/nt'

Reduced databases are a good choice for limited computer ressources and drastically faster similarity search. Here are some example commands using NCBI tools to download sequences.

  • Reduced NCBI databases:

Get all viroids nucleotide sequence from genbank:

esearch -db "nucleotide" -query "txid12884[Organism]" | efetch -format fasta > viroids_nucl.fna

Get all viruses nucleotide sequences from genbank:

esearch -db "nucleotide" -query "txid10239[Organism]" | efetch -format fasta > viruses_nucl.fna

Create Blast DB example:

makeblastdb -in viruses_nucl.fna -parse_seqids -dbtype nucl
  • Download PFAM files for RPSBLAST.
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/Pfam_LE.tar.gz
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/fasta.tar.gz
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz

Here I use only PFAM domains but fasta.tar.gz and cdd.tar.gz contains files for the entire CDD database. You can either delete files that are not from PFAM database or use the complete CDD.

  • Delete file that are not from PFAM:
\ls -1 | grep -v 'pfam' | sed 's,^.*$,rm &,'

Add ‘| bash’ if correct.

  • Download entire CDD database:
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/CDD_LE.tar.gz

Parameters files

parameters.yaml

Defines paths for both local and remote binaries and databases. A template is provided in the examples directory.

ReadSoustraction:
  db:
    vitis: '/media/data/db/ncbi/vitis/vitis'
    phiX:  '/media/data/db/ncbi/phiX/phiX174'
bin:
  bowtie: '/usr/local/bin/bowtie2'
  samtools: '/usr/bin/samtools'
  bedtools: '/usr/bin/bedtools'
  prinseq: '/usr/local/bin/prinseq-lite.pl'
  merge-paired-reads: '/home/stheil/softwares/sortmerna-2.1-linux-64/scripts/merge-paired-reads.sh'
  unmerge-paired-reads: '/home/stheil/softwares/sortmerna-2.1-linux-64/scripts/unmerge-paired-reads.sh'
  sortmerna: '/home/stheil/softwares/sortmerna/sortmerna'
servers:
  enki:
    db:
      nt: '/media/data/db/ncbi/nt/nt'
      nr: '/media/data/db/ncbi/nr/nr'
      refseq_vir_nucl: '/media/data/db/ncbi/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/media/data/db/ncbi/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/save/db/pfam/pfam_viruses_rpsdb'
      all_vir_nucl: '/media/data/db/ncbi/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/media/data/db/ncbi/all_vir/all_vir_prot.faa'
  genotoul:
    adress: 'genotoul.toulouse.inra.fr'
    username: 'stheil'
    db:
      nr: '/bank/blastdb/nr'
      nt: '/bank/blastdb/nt'
      refseq_vir_nucl: '/save/stheil/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/save/stheil/db/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/save/db/pfam/pfam_viruses_rpsdb'
      all_vir_nucl: '/home/stheil/save/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/stheil/save/db/all_vir/all_vir_prot.faa'
    scratch: '/work/stheil'
    bin:
      blastx: 'blastx+'
      blastn: 'blastn+'
  genologin:
    adress: 'genologin.toulouse.inra.fr'
    username: 'mlefebvre'
    db:
      nr: '/bank/ncbi/blast/nr/current/blast/nr'
      nt: '/bank/ncbi/blast/nr/current/blast/nt'
      dmd_nr: '/bank/diamonddb/nr'
      refseq_vir_nucl: '/save/mlefebvre/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/save/mlefebvre/db/refseq_vir/viral.protein.faa'
      pfam: '/home/mlefebvre/work/pfam/Pfam'
      all_vir_nucl: '/home/mlefebvre/save/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/mlefebvre/save/db/all_vir/all_vir_prot.faa'
    scratch: '/work/mlefebvre'
    bin:
      blastx: 'blastx'
      blastn: 'blastn'
  avakas:
    adress: 'avakas.mcia.univ-bordeaux.fr'
    username: 'stheil'
    db:
      nr: '/home/stheil/db/nr/nr'
      nt: '/home/stheil/db/nt/nt'
      all_vir_nucl: '/home/stheil/scratch/db/all_vir/all_vir_nucl.fna'
      all_vir_prot: '/home/stheil/scratch/db/all_vir/all_vir_prot.faa'
      refseq_vir_nucl: '/home/stheil/scratch/db/refseq_vir/viral.genomic.fna'
      refseq_vir_prot: '/home/stheil/scratch/db/refseq_vir/viral.protein.faa'
      pfam: '/home/stheil/db/pfam/pfam_viruses_rpsdb'
    scratch: '/scratch/stheil'
    bin:
      blastx: 'blastx'
      blastn: 'blastn'
Diamond:
  db:
    all_vir_prot: /media/db/ncbi/all_vir/all_vir_prot
SortMeRna:
  db:
    silva-arc-16s-id95: /media/data/db/rRNA_databases/silva-arc-16s-id95
    silva-arc-23s-id98: /media/data/db/rRNA_databases/silva-arc-23s-id98
    silva-bac-16s-id90: /media/data/db/rRNA_databases/silva-bac-16s-id90
    silva-bac-23s-id98: /media/data/db/rRNA_databases/silva-bac-23s-id98
    silva-euk-18s-id95: /media/data/db/rRNA_databases/silva-euk-18s-id95
    silva-euk-28s-id98: /media/data/db/rRNA_databases/silva-euk-28s-id98

step.yaml

Defines the steps that the pipeline will execute. A template is provided in the /examples directory.

Step names correspond to a python module that will launch the step. Step names are split based on the ‘_’ character so you can launch multiple instance. For example you might want to launch blastx and blastn, so step names could be ‘Blast_N’ and ‘Blast_X’. What is after the underscore do not matters, it is just used to differanciate the two steps.

Special words in bracket are used as substitution string. - (file), (file1) and (file2) - (SampleID) - (library)

ReadSoustraction_phiX:
  i1: (file1)
  i2: (file2)
  db: phiX
  o1: (library)_phiX.r1.fq
  o2: (library)_phiX.r2.fq
  sge: True
  n_cpu: 5
  iter: library
Demultiplex:
  i1: (library)_phiX.r1.fq
  i2: (library)_phiX.r2.fq
  adapters: adapters.fna
  middle: 1
  min_qual: 20
  polyA: True
  min_len: 70
  iter: library
  sge: True
DemultiplexHtml:
  csv: (library)_demultiplex.stats.csv
  id: (library)
  out: stat_demultiplex
  iter: global
  sge: True
Normalization:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  o1: (SampleID)_truePairs_norm_r1.fq
  o2: (SampleID)_truePairs_norm_r2.fq
  num: 40000
  iter: sample
  n_cpu: 5
  sge: True
drVM:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  n_cpu: 20
  identity: 70
  min_len: 300
  sge: True
Assembly_idba:
  prog: idba
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_idba.scaffold.fa
  sge: True
Assembly_spades:
  prog: spades
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_spades.scaffold.fa
  sge: True
Map_idba:
  contigs: (SampleID)_idba.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_idba.scaffold.bam
  rn: (SampleID)_idba.scaffold.rn
  sge: True
  n_cpu: 16
Map_spades:
  contigs: (SampleID)_spades.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_spades.scaffold.bam
  rn: (SampleID)_spades.scaffold.rn
  sge: True
  n_cpu: 16
Diamond:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  n_cpu: 10
  sge: True
  score: 50
  evalue: 0.0001
  qov: 50
  hov: 5
  db: all_vir_prot
Diamond_singletons_nr:
  contigs: (SampleID)_idba.scaffold.fa
  db: nr
  ising: (SampleID)_singletons.fq
  n_cpu: 10
  sge: True
  out: (SampleID)_singletons_test.nr.dmdx.xml
  evalue: 0.001
  iter: sample
  score: 10
  qov: 10
Diamond2blast:
  i: (SampleID)_idba.scaffold.dmdx.nr.csv
  contigs: (SampleID)_idba.scaffold.dmdx2bltx.fa
  out: (SampleID)_idba.scaffold.dmdx2bltx.nr.xml
  type: blastx
  db: nr
  evalue: 0.0001
  server: genologin
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_allvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: all_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  sge: True
  num_chunk: 1000
  tc: 50
Blast_nr:
  type: blastx
  contigs: (SampleID)_idba.scaffold.fa
  db: nr
  out: (SampleID)_idba.scaffold.bltx.nr.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_refvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: refseq_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  sge: True
Blast_singleton_nr:
  type: blastx
  contigs: (SampleID)_singletons.fa
  db: nr
  out: (SampleID)_singletons.bltx.nr.xml
  evalue: 0.0001
  server: genologin
  n_cpu: 8
  tc: 10
  num_chunk: 1000
  sge: True
Blast_RPS:
  type: rpstblastn
  contigs: (SampleID)_idba.scaffold.fa
  db: pfam
  evalue: 0.0001
  out: (SampleID)_idba.scaffold.rps.pfam.xml
  server: genotoul
  n_cpu: 8
  sge: True
Blast2ecsv_allvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  vs: True
  out: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  sge: True
  type: TBLASTX
  score: 50
  qov: 20
Blast2ecsv_refvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.0001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  vs: True
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  sge: True
  type: TBLASTX
  score: 50
  qov: 50
  hov: 5
Blast2ecsv_nr:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_idba.scaffold.bltx.nr.xml
  vs: True
  out: (SampleID)_idba.scaffold.bltx.nr.csv
  sge: True
  type: BLASTX
  score: 50
  qov: 5
  hov: 5
Blast2ecsv_dmd:
  evalue: 0.01
  fhit: True
  pm: global
  if: xml
  r: True
  b: (SampleID)_dmd.xml
  out: (SampleID)_dmd.allVirProt.csv
  sge: True
  type: BLASTX
  pd: True
Blast2ecsv_dmdx_singletons_nr:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global
  if: xml
  rn: (SampleID)_idba.scaffold.rn
  r: True
  b: (SampleID)_singletons.nr.dmdx.xml
  vs: True
  out: (SampleID)_singletons_test.nr.dmdx.csv
  sge: True
  type: DIAMONDX
  pd: True
Rps2ecsv:
  b: (SampleID)_idba.scaffold.rps.pfam.xml
  out: (SampleID)_idba.scaffold.rps.pfam.csv
  evalue: 0.0001
  sge: True
Ecsv2excel:
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  r:  (SampleID)_idba.scaffold.rps.pfam.csv
  out:  (SampleID)_idba.scaffold.xlsx
  sge: True
Ecsv2compare:
  b1: (SampleID)_idba.scaffold.bltx.nr.csv
  r:  (SampleID)_idba.scaffold.rps.pfam.csv
  out:  (SampleID)_idba.scaffold.comparison.xlsx
  sge: True
Blast2hist:
  id1: (SampleID)_refseq_tbltx
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  id2: (SampleID)_allvir_tbltx
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  id3: (SampleID)_nr_bltx
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  id4: (SampleID)_dmd
  b4: (SampleID)_dmd.allVirProt.csv
  iter: global
  sge: True
  out: blast_hist
Ecsv2krona:
  id1: (SampleID)_refseq_tbltx
  b1: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  x1: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  id2: (SampleID)_allvir_tbltx
  b2: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  x2: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  id3: (SampleID)_nr_bltx
  b3: (SampleID)_idba.scaffold.bltx.nr.csv
  x3: (SampleID)_idba.scaffold.bltx.nr.xml
  outdir: krona_blast
  out: blast.global.krona.html
  data: both
  r: True
  c: identity
  iter: global
  sge: True
Ecsv2krona_dmd:
  id1: (SampleID)
  b1: (SampleID)_dmd.allVirProt.csv
  outdir: krona_diamond
  out: global_krona_dmd.html
  data: contig
  r: True
  c: identity
  iter: global
  sge: True
Automapper_nr:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.bltx.nr.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_nr
  sge: True
  ref: nt
Automapper_allvirTX:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.tbltx.all_vir.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_allvir
  sge: True
  ref: all_vir_nucl
Automapper_refseqTX:
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_autoMapper_refseq
  sge: True
  ref: refseq_vir_nucl
Rps2tree:
  pfam: (SampleID)_idba.scaffold.rps.pfam.csv
  contigs: (SampleID)_idba.scaffold.fa
  ecsv: (SampleID)_idba.scaffold.bltx.nr.csv
  id: (SampleID)
  out: rps2tree_global
  min_prot: 100
  viral_portion: 0.3
  perc: 90
  iter: global
  sge: True
Getresults:
  global_dir1: rps2tree_global
  global_dir2: krona_blast
  global_dir3: krona_diamond
  global_dir4: blast_hist
  global_dir5: stat_demultiplex
  sample_dir1: (SampleID)_autoMapper_nr
  sample_dir2: (SampleID)_autoMapper_refseq
  sample_dir3r: (SampleID)_autoMapper_allvir
  sample_file1: (SampleID)_idba.scaffold.xlsx
  sample_file2: (SampleID)_idba.scaffold.fa
  sample_file3: (SampleID)_spades.scaffold.fa
  sample_file4: (SampleID)_truePairs_r1.fq
  sample_file5: (SampleID)_truePairs_r2.fq
  out: results

map.txt

The map file describe the experiment. It is a tabulated file with the first line containing headers starting with ‘#’. It must contain at least two column: SampleID and file. A template is provided in the examples directory. This is a minimum map.txt file:

#SampleID	mid	common	file1	file2	library
ds2016-121	AACCGCAA	TGTGTTGGGTGTGTTTGG	Lib1_phiX.R1.fastq	Lib1_phiX.R2.fastq	lib1
ds2016-132	AACTAGTA	TGTGTTGGGTGTGTTTGG	Lib1_phiX.R1.fastq	Lib1_phiX.R2.fastq	lib1
ds2016-122	AGGCGCCT	TGTGTTGGGTGTGTTTGG	Lib2_phiX.R1.fastq	Lib2_phiX.R2.fastq	lib2
ds2016-133	ATTAGCTA	TGTGTTGGGTGTGTTTGG	Lib2_phiX.R1.fastq	Lib2_phiX.R2.fastq	lib2
ds2016-123	CAAGAGTT	TGTGTTGGGTGTGTTTGG	Lib3_phiX.R1.fastq	Lib3_phiX.R2.fastq	lib3
ds2016-55	CAAGCAGG	TGTGTTGGGTGTGTTTGG	Lib3_phiX.R1.fastq	Lib3_phiX.R2.fastq	lib3
ds2016-124	CCAACCAT	TGTGTTGGGTGTGTTTGG	Lib4_phiX.R1.fastq	Lib4_phiX.R2.fastq	lib4
ds2016-56	CGATAGAG	TGTGTTGGGTGTGTTTGG	Lib4_phiX.R1.fastq	Lib4_phiX.R2.fastq	lib4
ds2016-125	GCTCTACC	TGTGTTGGGTGTGTTTGG	Lib5_phiX.R1.fastq	Lib5_phiX.R2.fastq	lib5
ds2016-57	GCTGCGGT	TGTGTTGGGTGTGTTTGG	Lib5_phiX.R1.fastq	Lib5_phiX.R2.fastq	lib5
ds2016-58	GGCCAGAA	TGTGTTGGGTGTGTTTGG	Lib6_phiX.R1.fastq	Lib6_phiX.R2.fastq	lib6
ds2016-10	GGTACTCC	TGTGTTGGGTGTGTTTGG	Lib6_phiX.R1.fastq	Lib6_phiX.R2.fastq	lib6
ds2016-11	TCGGATGC	TGTGTTGGGTGTGTTTGG	Lib7_phiX.R1.fastq	Lib7_phiX.R2.fastq	lib7
ds2015-149	TCTATGAC	TGTGTTGGGTGTGTTTGG	Lib7_phiX.R1.fastq	Lib7_phiX.R2.fastq	lib7
ds2015-162	TTCTGGCT	TGTGTTGGGTGTGTTTGG	Lib8_phiX.R1.fastq	Lib8_phiX.R2.fastq	lib8
ds2015-170	TTGCGTCA	TGTGTTGGGTGTGTTTGG	Lib8_phiX.R1.fastq	Lib8_phiX.R2.fastq	lib8

You can add categories for each sample so they can be used when coloring sequences in trees from the Rps2tree module. One library can be attributed to multiple samples, as shown in the example. Thus the demultiplexing step will be able to differentiate each sample and separate them.

Modules description

ReadSoustraction

ReadSoustraction module use bowtie2 to map reads against a reference. The output is directly piped to samtools for bam convertion and again piped to samtools to select unmapped paired reads (-f 12 -F 256). The pipe continues with bamtofastq tool to create two fastq files (r1 and r2).

def _create_cmd (self):
  cmd = ''
  cmd += self.params['bin']['bowtie'] + ' -p ' + str(self.n_cpu)
  cmd += ' -x ' + self.db
  cmd += ' -1 ' + self.i1 + ' -2 ' + self.i2 + ' | '
  cmd += self.params['bin']['samtools'] + ' view -bS - '
  cmd += ' | ' + self.params['bin']['samtools'] + ' view -u -f 12 -F 256 - | ' + self.params['bin']['bedtools'] + ' bamtofastq -i - -fq ' + self.o1 + ' -fq2 ' + self.o2
  log.debug(cmd)
  return cmd

Options

  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • db: Bowtie database.
  • o1: R1 output fastq file.
  • o2: R2 output fastq file.
  • sge: [BOOL]
  • n_cpu: [INT] Number of CPU to use.
  • iter: [SampleID]

Demultiplex

_images/demultiplex_0.png

Warning: This demultiplex procedure is specifc to our sequencing methods.

The demultiplex.pl script uses cutadapt to demultiplex and trim step by step sequences used by the sequencing technology and the dsRNA extraction protocol.

Our extraction and sequencing protocol:

Informations about indexes, common sequences, and sample ids are stored in the map.txt file which must be tab delimited file. The script produce lots of temporary file (*_step_*) that can be deleted at the end of the execution. Each step produce 3 type of files:

.info:
the matching information.
.log:
the execution information.
.out:
the fastq file.

Options

  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • adapters: Fasta file of adapters.
  • middle: Check for MIDs in middle of the reads and 1 : trim the reads or 2: exclude the read.
  • min_qual: Trim the read if the quality is below this threshold.
  • polyA: Trim poly-A tail.
  • min_len: Exclude reads under this size threshold.
  • iter: [SampleID]

Step 02: 5’ common sequence

_images/demultiplex_2.png
_launchCutAdapt($self,$files->{1},$self->{_common},$tmp_file_prefix . "_step.02_R1",'k','k','-g','0.1',scalar(keys(%{$self->{_common}})),'0.7');
  • 'k','k': Keep reads that contains or not the common part.
  • '-g': search in 5’ part.
  • '0.1': 10% of sequencing errors.
  • scalar(keys(%{$self->{_common}})): will search as many common part as provided.
  • '0.7': 70% of the common part length to be considered as a match.

Step 03: 5’ common sequence fragments

_images/demultiplex_3.png
_launchCutAdapt($self,$files->{1},$self->{_common},$tmp_file_prefix . "_step.021_R1",'k','k','-g','0.2',scalar(keys(%{$self->{_common}})),'0.5');
  • 'k','k': Keep reads that contains or not the common part.
  • '-g': search in 5’ part.
  • '0.2': 20% of sequencing errors.
  • scalar(keys(%{$self->{_common}})): will search as many common part as provided.
  • '0.5': 50% of the common part length to be considered as a match.

Step 04: Trimming sequencing adapters

_images/demultiplex_4.png
_launchCutAdapt($self,$files->{1},$self->{illuminaAdapter},$tmp_file_prefix . "_step.03_R1",'k','k','-b','0.2',scalar(keys(%{$self->{illuminaAdapter}})),'0.6');
  • 'k','k': Keep reads that contains or not the common part.
  • '-b': search adapters anywhere in the read.
  • '0.2': 20% of sequencing errors.
  • scalar(keys(%{$self->{illuminaAdapter}})): will search as many adapters as provided.
  • '0.6': 60% of the adapters length to be considered as a match.

Step 05: Search for hybrid reads

_images/demultiplex_5.png

This step is really specific to our extraction method since very short DNA fragment can be link together during the aspecific adapters ligation step of the Illumina kits. This creating reads composed of two different PCR product. Thus our program search for index sequence in the middle of the read and trim it to keep the 5’ part or exlude the read. The research is done both on provided indexes sequences and reverse complement of thoses sequences.

-middle [1|2] Search for common tag in the middle of the read. 1: trim the read. 2: exclude the read.

_launchCutAdapt($self,$files->{1},$h,$tmp_file_prefix . "_step.04_R1",'k','k','-b','0.1','1','0.5');
  • 'k','k': Keep reads that contains or not the index , or 'k','d' if the -middle option is provided.
  • '-b': search adapters anywhere in the read.
  • '0.1': 20% of sequencing errors.
  • '1': search for one index in each read.
  • '0.5': 50% of the adapters length to be considered as a match.

Step 06: Search for polyA (optional)

_images/demultiplex_6.png

In Illumina technology, if the sequencing matrix is too short compared to the sequencing length, the sequencing machine adds a bunch of A’s and then radom sequence.

_launchCutAdapt($self,$files->{1}, $h, $tmp_file_prefix . "_step.05_R1",'k','k','-a','0','1','0.8');
  • 'k','k': Keep reads that contains or not the index , or 'k','d' if the -middle option is provided.
  • '-a': search in the 3’ end.
  • '0': no sequencing errors.
  • '1': search for one index in each read.
  • '0.8': 80% of the polyA length to be considered as a match.

Assembly

This module can launch three assemblers, IDBA, MetaSpades and Newbler for single-end data.

Foreach assembler, the module convert reads files to the proper format, launch the assembly in a separate directory, rename scaffolds identifier and move results file to the sample root directory.

Options

Map

This module uses bowtie2, samtools and readPerContig.pl script to map reads back on the assembly and count for each scaffold the number of reads aligned resulting a simple two column file scaffoldID and nb_reads used by other modules.

Options

  • contigs: fasta file of contigs to map reads on. [mandatory]
  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • ising: singletons fastq file
  • n_cpu: [INT] number of CPU to use.
  • sge: [BOOL] use SGE scheduler.
  • bam: BAM file name.
  • rn: output file name.

Normalization

This module randomly select NUM reads from paired-files.

Options

  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • o1: Output R1 normalized file. [mandatory]
  • o2: Output R2 normalized file. [mandatory]
  • num: [INT] Number of reads to randomly select. [mandatory]
  • iter: Iteration on [sample, library].
  • n_cpu: [INT] number of CPU to use.
  • sge: [BOOL] use SGE scheduler.

Diamond

This module launches Diamond similarity search on reads and produce an XML file simalar to what Blast does so it can be treated by the Blast2ecsv module and so on.

Options

  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • db: Values are defined in the parameters.yaml file. [mandatory]
  • ising: singletons fastq file
  • n_cpu: [INT] number of CPU to use.
  • sge: [BOOL] use SGE scheduler.
  • sensitive: [BOOL]
  • more_sensitive: [BOOL]
  • out: XML output file
  • score: Report matches above this score.
  • max_target_seqs: Maximum match per query sequences.
  • evalue: Min e-value.
  • identity: Report matches above this identity percent. 0 > X > 100.
  • qov: Query overlap.
  • hov: Hit overlap.

Diamond2Blast

Options

  • i: CSV file with DIAMOND results. [mandatory]
  • contigs: Fasta file. [mandatory]
  • out: XML output file.
  • type: Blast type. [‘tblastx’,’blastx’,’blastn’,’blastp’,’rpstblastn’]. [mandatory]
  • db: Values are defined in the parameters.yaml file. [mandatory]
  • evalue: Min e-value.
  • server: [‘enki’,’genologin’,’avakas’] Values are defined in the parameters.yaml file.
  • n_cpu: [INT] number of CPU to use.
  • tc: Number of task launched at the same time on SGE.
  • num_chunk: Number of chunks to split the original fasta file for parallel execution.
  • max_target_seqs: Maximum match per query sequences.
  • sge: [BOOL] use SGE scheduler.

Blast

This module launches all type of Blast on local machine or distant servers. This module has been developped for our own local machines and servers, but it can be easly modified to fit your needs.

This module mainly depends on the parameters.yaml file and the blast_launch.py script which has to be present on the server you want to use and modified to fit your server configuration.

Options

  • contigs: Fasta file. [mandatory]
  • db: Values are defined in the parameters.yaml file. [mandatory]
  • type: Blast type. [‘tblastx’,’blastx’,’blastn’,’blastp’,’rpstblastn’]. [mandatory]
  • n_cpu: [INT] number of CPU to use.
  • tc: Number of task launched at the same time on SGE. (Experimental, works on Genotoul)
  • max_target_seqs: Maximum match per query sequences.
  • num_chunk: Number of chunks to split the original fasta file for parallel execution.
  • out: Output file name.
  • server: [‘enki’,’genologin’,’avakas’, ‘curta’] Values are defined in the parameters.yaml file.
  • sge: [BOOL] use SGE scheduler.

This module is able to launch Blast instance on distant servers if the database and the blast_launch.py script is present on the server. Then you have to edit the parameters.yaml file to fit your configuration. The script has been developped to use two computer cluster, Avakas (PBS + Torque) and Genotoul (SGE) but each cluster has its own configuration so you may have to modify this script to adapt it to your configuration.

Blast2ecsv

This module parse Blast xml outputs, filter matches on different criteria and link Accession number to NCBI taxonomy.

Options

  • b: Blast file.
  • if: Input format [‘xml’,’m8’]
  • out: Output file name.
  • evalue: Min e-value.
  • fhit: Only report first hit.
  • fhsp: Only report first hsp.
  • pm:
  • r: Reduced taxonomy. Report only 5 consistent rank.
  • vs: Only report sequences when match is virus or viroids.
  • rn: Read number. File created by the Map module.
  • type: Blast type. [‘tblastx’,’blastx’,’blastn’,’blastp’,’rpstblastn’]
  • score: Report matches above this score.
  • identity: Report matches above this identity percent. 0 > X > 100.
  • qov: Query overlap.
  • hov: Hit overlap.
  • pd: Parse description. Useful when the query ID is stored in the dscription field in the XML file.
  • sge: [BOOL] use SGE scheduler.

Ecsv2excel

This module aggregates multiple ecsv file to create a colored XLSX file. It launches the ecsv2krona.pl script.

Options

  • b[INT]: CSV Blast file from 1 to 10.
  • out: Outpuy file name.
  • r: RPSBLAST csv file.
  • sge: [BOOL] use SGE scheduler.

Ecsv2krona

This module launch the ecsv2krona.pl script. It will aggregate multiple ecsv file into one Krona html file.

Options

  • b: [INT] CSV Blast file.
  • id: [INT] ID wanted corresponding to the Blast file.
  • x: [INT] XML Blast file. If used, this file will be split by species and link in the Krona file.
  • out: Output file name.
  • data: [‘both’,’reads’,’contigs’,’none’]
  • r: Use reduced taxonomy.
  • c: [‘identity’,’taxid’,’none’]
  • iter: [‘global’]
  • sge: [BOOL] use SGE scheduler.

Rps2ecsv

This module launch rps2ecsv.pl script for each sample. This module parse XML files from rpsblast and create csv file as a result.

Options

  • b: RPSBLAST XML file.
  • contigs: Fasta file.
  • sge: [BOOL] use SGE scheduler.
  • out: Output file name.
  • evalue: e-value threshold.

Rps2tree

This module launch rps2tree.pl script for all sample provided. This module generates Operational Taxonomic Unit (OTU) based on RPS-Blast results. For each CDD motifs, contigs are clustered together based on matrix distance. The tree is generated thanks to ete3 toolkit.

Options

  • pfam: CSV file from Rps2ecsv.
  • contigs: Fasta file.
  • ecsv: CSV file from Blast2ecsv.
  • out: Output file name.
  • sge: [BOOL]
  • viral_portion: Minimum percentage of viral sequence in a domain to be selected.
  • min_prot: Minimum protein length to be included in a tree.
  • perc: Percentage of identity. Threshold set to define OTU.
  • iter: [‘global’]

Rps2merge

This module launches rps2merge script. It generates a summary file, merging OTU results, blast results and rps results. For each OTU, if multiple contigs corresponding, one is randomly selected.

Options

  • pfam: CSV file from Rps2ecsv.
  • blastx: CSV file from Blast2ecsv.
  • rps_folder: Name of output folder of Rps2tree.
  • id: Sample or library ID
  • out: CSV output file
  • iter: [‘global’]
  • sge: [BOOL]

AutoMapper

This module launch autoMapper.pl script for every sample.

Options

  • contigs: Fasta file.
  • ecsv: CSV file from Blast2ecsv.
  • i1: R1 fastq file. [mandatory]
  • i2: R2 fastq file. [mandatory]
  • out: Output folder.
  • sge: [BOOL]
  • ref: Blast database name.

Blast2hits

This takes multiple CSV Blast file from Blast2ecsv and draw histograms for by taxonomy.

Options

  • b[INT] CSV Blast file from Blast2ecsv
  • id[INT] ID associated with the Blast file.
  • iter [global]
  • sge [BOOL]
  • out Output file name.

Example execution

Create a directory for your experiment:

mkdir test_virAnnot
cd test_virAnnot

Copy example read files, Illumina adapters fasta file, the sample id mapping file, the step and parameter file:

cp /path/to/virAnnot/examples/reads/*.fq .
cp /path/to/virAnnot/examples/adapters.fa .
cp /path/to/virAnnot/examples/map.txt .
cp /path/to/virAnnot/examples/step.yaml .
cp /path/to/virAnnot/examples/parameters.yaml .

You have to modify this file to fit your configuration.

This example contains all modules and options available and must be used as a template for your own analysis.

Step ReadSoustraction

This module uses bowtie2 to map reads against nucleotide sequence and Samtools to extract unmapped pairs.

Corresponding step.yaml section:

ReadSoustraction_phiX:
  i1: (file1)
  i2: (file2)
  db: phiX
  o1: (library)_phiX.r1.fq
  o2: (library)_phiX.r2.fq
  sge: True
  n_cpu: 5
  iter: library
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n ReadSoustraction_phiX

Step Demultiplex

This module uses cutadapt demultiplex reads from library and also trim reads from adapters and chimeric reads. Each demultiplexing step are described in the module section. Corresponding step.yaml section:

Demultiplex:
  i1: (library)_phiX.r1.fq
  i2: (library)_phiX.r2.fq
  adapters: adapters.fna
  middle: 1
  min_qual: 20
  polyA: True
  min_len: 70
  iter: library
  sge: True
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Demultiplex

Step DemultiplexHtml

This module gather all *_demultiplex.stats.csv files and create and html report. Corresponding step.yaml section:

DemultiplexHtml:
  csv: (library)_demultiplex.stats.csv
  id: (library)
  out: stat_demultiplex
  iter: global
  sge: True
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n DemultiplexHtml

Output example:

Step Diamond

This module launch Diamond similarity search for reads and produce an XML file per sample.

  n_cpu: 16
Map_spades:
  contigs: (SampleID)_spades.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_spades.scaffold.bam
  rn: (SampleID)_spades.scaffold.rn
  sge: True
  n_cpu: 16
Diamond:
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Diamond

Step Assembly

This module simply launch udba_ud, newbler and metaspades assemblers in each sample folder, rename scaffolds id and move the resulting fasta file.

  n_cpu: 5
  sge: True
drVM:
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  n_cpu: 20
  identity: 70
  min_len: 300
  sge: True
Assembly_idba:
  prog: idba
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq

Test both idba and spades:

virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Assembly_idba
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Assembly_spades

Example of idba assembly:

>ds2015-149_0
GTGTAAGGTGGTGAAGG...
>ds2015-149_1
CCTGCGAATTGGGCCAA...

Step Map

This module uses bowtie2 to map reads back on assemblies and samtools will count reads per contig.

Step file:

  out: (SampleID)_idba.scaffold.fa
  sge: True
Assembly_spades:
  prog: spades
  n_cpu: 5
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  out: (SampleID)_spades.scaffold.fa
  sge: True
Map_idba:
  contigs: (SampleID)_idba.scaffold.fa
  i1: (SampleID)_truePairs_r1.fq
  i2: (SampleID)_truePairs_r2.fq
  bam: (SampleID)_idba.scaffold.bam
  rn: (SampleID)_idba.scaffold.rn
  sge: True

Command:

virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Map_idba
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Map_spades

Output a two column tabular file, column 1 sequence ID, column 2 number of reads. Example of .rn file produce:

ds2015-149_0    1179
ds2015-149_1    444
ds2015-149_10   26
ds2015-149_11   44
ds2015-149_12   14
ds2015-149_13   4
ds2015-149_14   6

Step Blast

This module is able to launch Blast(s) against provided databases localy or remotely. The script blast_launch.py must be present on distant servers and parameter.yaml modified to fit your servers.

Step file:

  i: (SampleID)_idba.scaffold.dmdx.nr.csv
  contigs: (SampleID)_idba.scaffold.dmdx2bltx.fa
  out: (SampleID)_idba.scaffold.dmdx2bltx.nr.xml
  type: blastx
  db: nr
  evalue: 0.0001
  server: genologin
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_allvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: all_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.all_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  sge: True
  num_chunk: 1000
  tc: 50
Blast_nr:
  type: blastx
  contigs: (SampleID)_idba.scaffold.fa
  db: nr
  out: (SampleID)_idba.scaffold.bltx.nr.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8
  tc: 50
  num_chunk: 1000
  max_target_seqs: 1
  sge: True
Blast_refvirTX:
  type: tblastx
  contigs: (SampleID)_idba.scaffold.fa
  db: refseq_vir_nucl
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.xml
  evalue: 0.0001
  server: genotoul
  n_cpu: 8

Commands:

virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast_nr
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast_refvirTX
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast_allvirTX
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast_RPS

Step Blast2ecsv

This module uses the XML file generated by the corresponding Blast module and the taxonomy contained in the SQLITE database to create a csv file containing match options, taxonomy string and sequences.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast2ecsv_nr
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast2ecsv_refvirTX
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Blast2ecsv_allvirTX

Example output of ds2015-149_idba.scaffold.tbltx.all_vir.csv:

#algo   query_id        nb_reads        query_length    accession       description     organism        percentIdentity nb_hsps queryOverlap    hitOverlap      evalue  score   tax_id  taxonomy        sequence
"TBLASTX"       "ds2015-149_52" "6"     "117"   "KX274275.1"    "Grapevine rupestris stem pitting associated virus isolate SK704 B, complete genome"    "Grapevine rupestris stem pitting-associated virus"     "95.8"  "3"     "100"   "3"     "7.55823333338424e-05"  "222.2257"      "196400"        "Viruses;ssRNA viruses;Betaflexiviridae;Foveavirus;Grapevine rupestris stem pitting-associated virus"   "GAACACTATGAACGACAACTGGAAATCTGAGCACGCTATAAACACCCACAAACTCAAACGTAGACAAAGCTTTAACTAAGTTATTCATAATAATCACACCATGCCAAACACTTAAGG"

Step Rps2ecsv

This module uses the rpstblastn XML file and the PFAM taxonomy to annotate query sequences and produce a readable CSV file.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Rps2ecsv
  vs: True
  out: (SampleID)_idba.scaffold.tbltx.refseq_vir.csv
  sge: True
  type: TBLASTX
  score: 50

Example output of ds2015-149_idba.scaffold.rps.pfam.csv:

#query_id       query_length    cdd_id  hit_id  evalue  startQ  endQ    frame   description     superkingdom    no rank family  genus
"ds2015-149_0"  "1428"  "pfam01443"     "gnl|CDD|279750"        "1.33194e-06"   "29"    "223"   "2"     "pfam01443, Viral_helicase1, Viral (Superfamily 1) RNA helicase.  Helicase activity for this family has been demonstrated and NTPase activity. This helicase has multiple roles at different stages of viral RNA replication, as dissected by mutational analysis."     "Viruses(1.00);"        "ssRNA viruses(0.99);unclassified viruses(0.01);"       "Alphaflexiviridae(0.36);Virgaviridae(0.24);Betaflexiviridae(0.15);Tymoviridae(0.10);Bromoviridae(0.07);"       "Potexvirus(0.26);Allexivirus(0.10);Carlavirus(0.08);Tymovirus(0.08);Tobamovirus(0.08);"

Step Ecsv2excel

This module takes all csv files and create a compile them in a single Excel file.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Ecsv2excel
  qov: 50
  hov: 5
Blast2ecsv_nr:
  contigs: (SampleID)_idba.scaffold.fa
  evalue: 0.001
  fhit: True
  pm: global

Example output of ds2015-149_idba.scaffold.xlsx:

_images/ecsv2excel.png

Step Ecsv2krona

This module uses CSV files from Blast2ecsv module to create Krona html file.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Ecsv2krona
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Ecsv2krona_dmd

Example output of Krona:

Step Automapper

This module uses Blast CSV annotation file to select reference sequences, map query sequences and produce png of identity plot and alignment file in fasta format.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Automapper_nr
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Automapper_allvirTX
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Automapper_refseqTX

Example output of ds2015-149/ds2015-149_autoMapper_nr:

Step Rps2tree

This module use Rps2ecsv results of all sample to cut and group sequences based on identified domains and create OTUs, identity matrix, tree nexus files and png for each domains colored by SampleID.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Rps2tree

Step Getresults

This module simply copy important results file to a result directory.
virAnnot.py -m map.txt -s step.yaml -p parameters.yaml -n Getresults