Motivation: An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized parameters and post-process the resulting scaffolds to correct assembly errors. Limitations of the use of a single-genome assembler for de novo metagenome assembly are that highly conserved sequences shared between different species often causes chimera contigs, and sequences of highly abundant species are likely mis-identified as repeats in a single genome.
Methods: We modified and extended a single-genome and de Bruijn-graph based assembler, Velvet, for de novo metagenome assembly. Our fundamental ideas are first decomposing de Bruijn graph constructed from mixed short reads into individual sub-graphs and second building scaffolds based on every decomposed de Bruijn sub-graph as isolate species genome.
Longer N50 size: On artificially generated simulation datasets of short sequence reads, it is proven that MetaVelvet assembled metegenomic read data with significantly longer N50 sizes than single-genome assemblers (Velvet and SOAPdenovo) and another meta-genome assembler (MetaIDBA) as shown in the table below.
Improved gene prediction: We applied Velvet and MetaVelvet to real metagenomic dataset (SRR041654 and SRR041655; human gut microbiome data downloaded from NCBI sequence read archive). Then, MetaGeneAnnotator was used to predict a gene set of the microbiome. From the results, it is demonstrated that MetaVelvet can efficiently avoid chimeric scaffolds, resulting in longer scaffolds, increased number of predicted genes, and decreased fraction of partial genes.
Widely applicable to matagenomic analyses: We also compared the results of metagenomic analyses, such as taxonomic content analysis, functional composition analysis, and pathway mapping analysis, between Velvet and MetaVelvet. The results demonstrate that taxonomic content can be estimated more clearly when based on the MetaVelvet scaffolds than when based on the Velvet scaffolds as shown in the figure below left (the number of genes that cannot be assigned to phylum-level taxonomy is substantially decreased). Moreover, the result of functional composition analysis also shows that MetaVelvet is more appropriate for metagenomics purpose than Velvet as shown in the figure below right (the number of genes assigned to any functional category is improved).
The following source codes is freely available under GNU General Public License.
Source codes at these versions can be downloaded from here.
~$ tar zxvf MataVelvet-v1.1.01.tgz
~$ git clone git://github.com/hacchy/MetaVelvet.git
~$ cd MetaVelvet-v1.1.01
~/MetaVelvet-v1.1.01$ make ['MAXKMERLENGTH = k'] ['CATEGORIES = cat']k - the maximum k-mer size you like to use.
~/MetaVelvet-v1.1.01$ cp meta-velvetg /usr/bin/
MetaVelvet (>= 1.1.01) uses velveth and velvetg outputs for metagenomics assembly. Three files ("Sequences", "Roadmaps", and "Graph2") are needed for running meta-velvetg.
~$ tar zxvf HMP.small.tar.gz
~$ velveth out-dir 51 \Please check that "out-dir/Sequences" and "out-dir/Roadmaps" files are created. Note that k-mer size has important effect on assembly results. If your short read data is longer than 65bp, we recommend k-mer longer than 51.
-fasta -shortPaired \
HMP.small/SRR041654_shuffled.fasta \
HMP.small/SRR041655_shuffled.fasta
~$ velvetg out-dir -exp_cov auto -ins_length 260Please check that "out-dir/Graph2" file is created. Please use -exp_cov auto option, otherwise meta-velvetg cannot be executed.
~$ meta-velvetg out-dir [-ins_length 260] | tee logfilePlease check that "out-dir/meta-velvetg.contigs.fa" file is created. The FASTA file is the main assembling results.
Contrary to single-genome assemblers (e.g., Velvet and SOAPdenovo), MetaVelvet assumes metagenomics settings. Accordingly, k-mer coverage histogram might be multi-modal rather than uni-modal. Please note that the coverage peak parameters can largely affect MetaVelvet assembling results. Although simple and automatic peak detection algorithm is implemented in meta-velvetg, we strongly recommend manual inspection of k-mer coverage peaks and manual setting of the coverage peak parameters. For manually setting the coverage peak parameters, please execute the following procedures.
~$ RPlease determine the coverage peak values from the histogram.
(R) > install.packages("plotrix")
(R) > library(plotrix)
(R) > data = read.table("out-dir/meta-velvetg.Graph2-stats.txt", header=TRUE)
(R) > weighted.hist(data$short1_cov, data$lgth, breaks=seq(0, 200, by=1))
~$ scriptEstimateCovMulti.py out-dir/stats_EstimateCovMulti.txtWhen applying the script to the above histogram, six peaks (214x, 122x, 70x, 43x, 25x, and 13.5x) are detected by the script. 7th-peak around 6x is removed because such low coverage peak can be caused by sequencing errors.
~$ meta-velvetg out-dir -ins_length 260 \Please note that meta-velvetg assumes that the peak values are sorted in a descending order.
-exp_covs 214_122_70_43_25_13.5
Bambus2 is a scaffolding module that can be applicable to metagenomics settings. MetaVelvet uses a novel graph splitting algorithm during contiging process, and uses the scaffolding module of Velvet (RockBand and Pebble) during scaffolding process. Alternatively, users can output MetaVelvet contigs and uses Bambus2 scaffolding module instead of using RockBand and Pebble.
To install Bambus2 programs, please see their web site http://www.cbcb.umd.edu/software/bambus/.
~$ meta-velvetg out-dir -ins_length 260 \Please check that "out-dir/meta-velvetg.asm.afg" file is created. The file is meta-velvetg contiging result in an AMOS format.
-exp_covs 214_122_70_43_25_13.5 \
-amos_file yes -scaffolding no
~$ bank-transact -m out-dir/meta-velvetg.asm.afg -b bank-dir -cf
~$ clk -b bank-dir 2> /dev/null
~$ Bundler -b bank-dir -t M 2> /dev/null
~$ MarkRepeats -b bank-dir \
-redundancy num-pairs \
2> /dev/null \
| tee repeat-file
~$ OrientContigs -b bank-dir \
-redundancy num-pairs \
-prefix output-prefix \
-repeats repeat-file \
-all \
2> /dev/null
~$ untangle -e output-prefix.evidence.xml \
-s output-prefix.out.xml \
-o output-prefix.untangle.xml \
2> /dev/null
~$ bank2fasta -d -b bank-dir \
| tee contig-file
~$ printScaff -merge -e output-prefix.evidence.xml \Then, scaffold-file is the final scaffolding results in the FastA format.
-o output-prefix \
-s output-prefix.untangle.xml \
-l output-prefix.library \
-f scaffold-file \
2> /dev/null
In order to decompose a complecated metagenomic de Bruijn graph into simpler subgraphs,
MetaVelvet searches the following X-structure, i.e., chimeric nodes have two incoming edges and two outgoing edges.
Then, MetaVelvet split the X-structures when the following conditions are satisfied:
> weighted.hist(data$shot1_cov,data$lgth,breaks=seq(0,200,1))Solution: This warning (error) is caused by "Inf" values in the Graph2 node stats. Accordingly, by removing "Inf" values from the Graph2 stats, the error is resolved:
Warning messages:
1: In min(x, na.rm = na.rm) :
no non-missing arguments to min; returning Inf
2: In max(x, na.rm = na.rm) :
no non-missing arguments to max; returning -Inf
3: In weighted.hist(data$shot1_cov, data$lgth, breaks = seq(0, 200, :
Areas will not relate to frequencies
$ head -n 1 meta-velvetg.Graph2-stats.txt \
> meta-velvetg.Graph2-stats.rmInf.txt
$ perl -ne '{print $_ unless /Inf/;}' \
meta-velvetg.Graph2-stats.txt \
>> meta-velvetg.Graph2-stats.rmInf.txt
$ R
> library(plotrix)
> data = read.table("meta-velvetg.Graph2-stats.rmInf.txt", header=TRUE)
> weighted.hist(data$shot1_cov,data$lgth,breaks=seq(0,200,1))