MetaVelvet-SL : An extension of Velvet assembler to de novo metagenomic assembler utilizing supervised learning
Introduction
For the graph disconnection task, MetaVelvet identifies shared nodes (named chimeric nodes) between two subgraphs and disconnects two subgraphs by splitting the shared nodes. To identify chimeric nodes, MetaVelvet uses a simple heuristics based on coverage difference and paired-end information. One important remaining subject of MetaVelvet is the low sensitivity and low accuracy of detecting chimeric nodes, which prevents generating further longer contigs and scaffolds. We tackled this problem of detecting chimeric nodes by using supervised machine learning.
MetaVelvet-SL succeeded in identifying more chimeric nodes more precisely than MetaVelvet, and outperformed MetaVelvet and the state-of-the-art superior metagenomic assembler, IDBA-UD , Ray Meta and Omega to reconstruct accurate longer assemblies of longer N50 for simulated dataset and real datasets of human gut microbial short read data.
Modules
MetaVelvet-SL consists of two main modules :
- The supervised learning module
This module is to develop a learning model for classification of chimeric nodes. Support Vector Machine (SVM) is used for the learning and classification model, 94 features are extracted for each candidate of chimeric nodes. - The assembly module
MetaVelvet-SL provides access to the users to generate their learning model by using prior knowledge about the taxonomy profile of the target microbial community. How to generate a learning model? here.
The taxonomic profile can also be inferred by using the taxonomic profiling method such as MetaPhlAn (Segata et al.,2012) from sequence reads. This customized learning model could fit well to the assembly of the target metagenomes. The pipeline connecting MetaPhlAn and MetaVelvet-SL can be accessed here.
MetaVelvet-SL also provides the library of pre-trained classification models for several typical environments such as soil, deep sea, mud, human blood, intestine, and mouth.
Install
The MetaVelvet-SL codes consist of 2 packages :
- Feature extraction, which uses Perl programming language.
To generate a learning model, extract features and do classification, the source codes can be downloaded here. - Assembly package, you can download the source codes here.
- Decompress the tar ball. ~$ tar zxvf MataVelvetSL-v1.0.tgz
- Change directory to the MetaVelvetSL directory. ~$ cd MetaVelvetSL-v1.0
- Compile MetaVelvetSL execution files. ~/MetaVelvetSL-v1.0$ make ['MAXKMERLENGTH = k'] ['CATEGORIES = cat'] k - the maximum k-mer size you like to use. cat - the maximum number of read categories (i.e., maximum number of libraries in different insert lengths) Then, an executable files, meta-velvete and meta-velvetg will be created
- Copy the two executable files to /usr/bin or a directory you like to install. ~/MetaVelvetSL-v1.0$ cp meta-velvete /usr/bin/ ~/MetaVelvetSL-v1.0$ cp meta-velvetg /usr/bin/
- Velvet
MetaVelvet-SL uses Velvet function to construct a de bruijn graph. Please use the last version of Velvet, you can download here.
How to install Velvet, please read the manual of Velvet here. - DWGSIM package in the DNAA package
DWGSIM is used to generate simulated sequence reads, you can download here.
Steps to install the DWGSIM package :- Extract the DWGSIM
- Download SAMTools from here,
extract SAMTools
~$ tar -xjf samtools-0.1.18.tar.bz2
- Place SAMTools into DWGSIM directory
- Rename samtools-
into samtools
~$ mv samtools-0.1.18 samtools
- Compile DWGSIM execution files
~/DWGSIM$ make
- Compile DWGSIM execution files
- LIBSVM
MetaVelvet-SL uses LIBSVM to develop a learning model. You can download here.
The supervised learning module
MetaVelvet-SL provides access to the users to generate their learning model by using prior knowledge about the taxonomy profile of the target microbial community. The taxonomic profile can also be inferred by using the taxonomic profiling method such as MetaPhlAn (Segata et al.,2012) from sequence reads.
Generate a learning module
There are 5 steps to generate a learning module :
- Generate simulated sequence reads from reference genomes.
You can use any simulator, we recommend to use the most frequently used simulated DWGSIM package in the DNAA package.
The main inputs are reference genomes and the coverages.- Generate reads for each reference genome
~$ ./dwgsim -1 [length-of-thefirst-read] -2 [length-of-thesecond-read] \
-e [error-rate-of-thefirst-read] -E [error-rate-of-thesecond-read] \
-C [coverage] [referencegenome-species-1] [output:reads-species-1]
The default insert size : (mean,std dev)=(500,50)
The default length of the first reads = 70bp
The default length of the second reads = 70bp
The default length of the coverage = 100
- Mix the reads of multiple species
~$ cat [reads-species-1] [read-species-2] ... [read-species-n] > [output:read-mix-multiple-species]
- Generate reads for each reference genome
- Construct a de Bruijn graph using Velvet functions.
~$ velveth [out-dir] [kmer] -shortPaired -fastq [read file]
~$ velvetg [out-dir] -ins_length [insertlength] -read_trkg yes -exp_cov auto
- Extract candidates of chimeric nodes.
~$ metavelvete [out-dir] -ins_length [insert-length]
- Extract features from candidates of chimeric nodes.
- Mix the reference genomes
~$ cat [reference-genome-species-1] [reference-genome-species-2] ... [reference-genome-species-n] > [output:mixed-reference-genomes-]
- Align each candidate of chimeric nodes to the mixed reference genomes, please use the codes in the BLAST_map
directory in the learning module codes
~$ perl eval.pl -i [out-dir]/meta-velvetg.subgraph__ChimeraNodeCandidates -n [projectname] \
-d [referencegenomemic] -p [projectname] -L 0
- Extract Features
~$ perl FeatureExtract.perl [out-dir]/meta-velvetg.subgraph__TitleChimeraNodeCandidates \
[mappingresult : *.long-scafs.blast] [out-dir]/Features3Class [out-dir]/ChimeraTrue [out-dir]/ChimeraNodeName
- Mix the reference genomes
- Learning using LIBSVM.
First, change to the LIBSVM directory, then do learning :
~$ tools/python easy.py [out-dir]/Features3Class
Library of pre-trained classification models
MetaVelvet-SL also provides the library of pre-trained classification models for several typical environments such as soil, deep sea, mud, human blood, intestine, and mouth. The library contains the list of species and the model for classification (*.model and *.range). The library can be downloaded here.
The assembly module
- Construct a de Bruijn graph using Velvet functions.
~$ velveth [out-dir] [kmer] -shortPaired -fastq [read file]
~$ velvetg [out-dir] -ins_length [insertlength] -read_trkg yes -exp_cov auto
- Extract candidates of chimeric nodes.
~$ metavelvete [out-dir] -ins_length [insert-length]
- Extract features from candidates of chimeric nodes.
- Extract Features
~$ perl FeatureExtractPredict.perl [out-dir]/meta-velvetg.subgraph__TitleChimeraNodeCandidates \
[out-dir]/Features [out-dir]/Features3Class [out-dir]/ChimeraNodeName
- Extract Features
- Do classification using a learning model
First, change to the LIBSVM directory, then do classification :
~$ ./svm-scale -r tools/[learning].range [out-dir]/Features3Class > [out-dir]/Features3Class.scale
~$ ./svm-predict [out-dir]/Features3Class.scale tools/[learning].model [out-dir]/ClassificationPredict
- Do assembly
~$ metavelvetg [out-dir] -ins_length [insert-length]
Pipeline connecting MetaPhlAn and MetaVelvet-SL
MetaVelvet-SL provides access to the users to generate their learning model by using prior knowledge about the taxonomy profile of the target microbial community. The taxonomic profile can be inferred from sequence reads by using MetaPhlAn (Segata et al.,2012). This customized classification model could be well suited to the assembly of the target metagenomes. For trying, you can use this small dataset ( HMP.small).- Get the taxonomy profile from sequence reads by using MetaPhlAn
- Get the MetaPhlAn
$ hg clone https://hg@bitbucket.org/nsegata/metaphlan
- Make a directory in the metaphlan directory for the profile results from MetaPhlAn
$ mkdir profil
- Do Metaphlan profiling
$ tar zxvf HMP.small.tar.gz --to-stdout | ./metaphlan.py --bowtie2db bowtie2db/mpa --bt2_ps very-sensitive --input_type multifasta > profil/HMPSmall.txt
Please refer to the MetaPhlAn help ($ ./metaphlan.py -h) or to the MetaPhlAn wiki for specific information about other strategies and additional MetaPhlAn options.
- Get the MetaPhlAn
- Generate a learning module
After got the taxonomy profile, the next steps are (i) obtain the reference genomes and (ii) generate a learning module, steps to generate a learning module can be accessed here.
- Do assembly
The last step is assembly. The steps for assembly can be accessed here.
For using the pipeline :
- Get the required tools, put them in the pipeline directory and install them
- Metavelvet-SL (both assembly module and learning module)
- MetaPhlan
- DWGSIM
- Velvet
- LIBSVM
- Metavelvet-SL (both assembly module and learning module)
- Put the reads (input) in a directory in the pipeline directory
- Running the pipeline, the syntax is as follows
perl GenerateCommand.perl [the path of pipeline directory in your machine]
[the name of DWGSIM directory in your machine] [the name of Velvet directory in your machine]
[the name of LIBSVM directory in your machine] [k-mer size] [[insert size1] [insert size n] ]
[[-file_format][-read_type] filename]
For example, using HMP small dataset:
perl GenerateCommand.perl /home/machine/Pipeline DWGSIM-dwgsim.0.1.10 velvet_1.2.10
MetaVelvetSLv1.0 libsvm-3.18 HMP.small 51 260
-fasta -shortPaired HMP.small/SRR041654_shuffled.fasta HMP.small/SRR041655_shuffled.fasta
The output is shell script (CommandAll.sh), please run the shell script in your machine. The final output of assembly/scaffolds (meta-velvetg.contigsSL.fa) is found in the [input]Assembly directory (ex: HMP.smallAssembly) in the pipeline directory.
Supplementary
- The supplementary file can be downloaded here