MetaVelvet: a short read assember for metagenomics

Introduction

Motivation: An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized parameters and post-process the resulting scaffolds to correct assembly errors. Limitations of the use of a single-genome assembler for de novo metagenome assembly are that highly conserved sequences shared between different species often causes chimera contigs, and sequences of highly abundant species are likely mis-identified as repeats in a single genome.

Methods: We modified and extended a single-genome and de Bruijn-graph based assembler, Velvet, for de novo metagenome assembly. Our fundamental ideas are first decomposing de Bruijn graph constructed from mixed short reads into individual sub-graphs and second building scaffolds based on every decomposed de Bruijn sub-graph as isolate species genome.


What is improved?

Longer N50 size: On artificially generated simulation datasets of short sequence reads, it is proven that MetaVelvet assembled metegenomic read data with significantly longer N50 sizes than single-genome assemblers (Velvet and SOAPdenovo) and another meta-genome assembler (MetaIDBA) as shown in the table below.


Improved gene prediction: We applied Velvet and MetaVelvet to real metagenomic dataset (SRR041654 and SRR041655; human gut microbiome data downloaded from NCBI sequence read archive). Then, MetaGeneAnnotator was used to predict a gene set of the microbiome. From the results, it is demonstrated that MetaVelvet can efficiently avoid chimeric scaffolds, resulting in longer scaffolds, increased number of predicted genes, and decreased fraction of partial genes.


Widely applicable to matagenomic analyses: We also compared the results of metagenomic analyses, such as taxonomic content analysis, functional composition analysis, and pathway mapping analysis, between Velvet and MetaVelvet. The results demonstrate that taxonomic content can be estimated more clearly when based on the MetaVelvet scaffolds than when based on the Velvet scaffolds as shown in the figure below left (the number of genes that cannot be assigned to phylum-level taxonomy is substantially decreased). Moreover, the result of functional composition analysis also shows that MetaVelvet is more appropriate for metagenomics purpose than Velvet as shown in the figure below right (the number of genes assigned to any functional category is improved).



Source Code (Release Notes)

The following source codes is freely available under GNU General Public License.

Source codes at these versions can be downloaded from here.


Requirements (>= 1.1.01)


Install MetaVelvet

  1. Download source code from here, and decompress the tar ball.
    ~$ tar zxvf MataVelvet-v1.1.01.tgz

    Alternatively, up-to-date source code is available from github MetaVelvet project.
    ~$ git clone git://github.com/hacchy/MetaVelvet.git

  2. Change directory to the MetaVelvet directory.
    ~$ cd MetaVelvet-v1.1.01

  3. Compile MetaVelvet execution files.
    ~/MetaVelvet-v1.1.01$ make ['MAXKMERLENGTH = k'] ['CATEGORIES = cat']
    k - the maximum k-mer size you like to use.
    cat - the maximum number of read categories (i.e., maximum number of libraries in different insert lengths)
    Then, an executable files, meta-velvegh will be created (>= 1.1.01).

  4. Copy the two executable files to /usr/bin or a directory you like to install.
    ~/MetaVelvet-v1.1.01$ cp meta-velvetg /usr/bin/

Getting started (>= 1.1.01)

MetaVelvet (>= 1.1.01) uses velveth and velvetg outputs for metagenomics assembly. Three files ("Sequences", "Roadmaps", and "Graph2") are needed for running meta-velvetg.

  1. Download a small dataset from here, and decompress the tar ball.
    ~$ tar zxvf HMP.small.tar.gz

  2. Execute velveth to import read sequences and construct k-mer hash table.
    ~$ velveth out-dir 51 \
            -fasta -shortPaired \
            HMP.small/SRR041654_shuffled.fasta \
            HMP.small/SRR041655_shuffled.fasta
    Please check that "out-dir/Sequences" and "out-dir/Roadmaps" files are created. Note that k-mer size has important effect on assembly results. If your short read data is longer than 65bp, we recommend k-mer longer than 51.

  3. Execute velvetg to construct an initial de Bruijin graph.
    ~$ velvetg out-dir -exp_cov auto -ins_length 260
    Please check that "out-dir/Graph2" file is created. Please use -exp_cov auto option, otherwise meta-velvetg cannot be executed.

  4. Execute meta-velvetg to output scaffold sequences.
    ~$ meta-velvetg out-dir [-ins_length 260] | tee logfile
    Please check that "out-dir/meta-velvetg.contigs.fa" file is created. The FASTA file is the main assembling results.

Advanced topics 1: Manual setting of k-mer coverage peaks (>= 1.1.01)

Contrary to single-genome assemblers (e.g., Velvet and SOAPdenovo), MetaVelvet assumes metagenomics settings. Accordingly, k-mer coverage histogram might be multi-modal rather than uni-modal. Please note that the coverage peak parameters can largely affect MetaVelvet assembling results. Although simple and automatic peak detection algorithm is implemented in meta-velvetg, we strongly recommend manual inspection of k-mer coverage peaks and manual setting of the coverage peak parameters. For manually setting the coverage peak parameters, please execute the following procedures.

  1. Execute velveth, velvetg, and meta-velvetg as in the "Getting started" section. Please check that "out-dir/meta-velvetg.Graph2-stats.txt" is created.

  2. Draw a k-mer coverage histogram and manually determine the k-mer coverage peaks.
    ~$ R
    (R) > install.packages("plotrix")
    (R) > library(plotrix)
    (R) > data = read.table("out-dir/meta-velvetg.Graph2-stats.txt", header=TRUE)
    (R) > weighted.hist(data$short1_cov, data$lgth, breaks=seq(0, 200, by=1))
    Please determine the coverage peak values from the histogram.

    For example, from the above histogram, seven coverage peaks are observed (around 210x, 120x, 70x, 45x, 23x, 12x, and 6x).
    If errors are occurred in this step, please see the "Trouble shootings" section.

  3. Alternatively, a Python script "scripts/scriptEstimateCovMulti.py" included in the MetaVelvet package can be used to determine coverage peak values.
    ~$ scriptEstimateCovMulti.py out-dir/stats_EstimateCovMulti.txt
    When applying the script to the above histogram, six peaks (214x, 122x, 70x, 43x, 25x, and 13.5x) are detected by the script. 7th-peak around 6x is removed because such low coverage peak can be caused by sequencing errors.

  4. Run meta-velvetg with manual setting of coverage peaks.
    ~$ meta-velvetg out-dir -ins_length 260 \
            -exp_covs 214_122_70_43_25_13.5
    Please note that meta-velvetg assumes that the peak values are sorted in a descending order.

Advanced topics 2: Use Bambus2 scaffolding module (>= 1.1.01)

Bambus2 is a scaffolding module that can be applicable to metagenomics settings. MetaVelvet uses a novel graph splitting algorithm during contiging process, and uses the scaffolding module of Velvet (RockBand and Pebble) during scaffolding process. Alternatively, users can output MetaVelvet contigs and uses Bambus2 scaffolding module instead of using RockBand and Pebble.

To install Bambus2 programs, please see their web site http://www.cbcb.umd.edu/software/bambus/.

  1. Execute velveth, velvetg as in the "Getting started" section. Please check that "out-dir/Sequences", "out-dir/Roadmaps", and "out-dir/Graph2" files are created.

  2. Execute meta-velvetg with -amos_file and -scaffolding options.
    ~$ meta-velvetg out-dir -ins_length 260 \
            -exp_covs 214_122_70_43_25_13.5 \
            -amos_file yes -scaffolding no
    Please check that "out-dir/meta-velvetg.asm.afg" file is created. The file is meta-velvetg contiging result in an AMOS format.
  3. Create amos bank.
    ~$ bank-transact -m out-dir/meta-velvetg.asm.afg -b bank-dir -cf
  4. Use the paired-end information to construct a collection of contig links
    ~$ clk -b bank-dir 2> /dev/null
  5. Bundle the contig links into a collection of contig edges
    ~$ Bundler -b bank-dir -t M 2> /dev/null
  6. Identify genomic repeats and output them
    ~$ MarkRepeats -b bank-dir \
            -redundancy num-pairs \
            2> /dev/null \
            | tee repeat-file
  7. Order and orient contigs according to repeat and link information
    ~$ OrientContigs -b bank-dir \
            -redundancy num-pairs \
            -prefix output-prefix \
            -repeats repeat-file \
            -all \
            2> /dev/null
  8. Linearize the scaffolds
    ~$ untangle -e output-prefix.evidence.xml \
            -s output-prefix.out.xml \
            -o output-prefix.untangle.xml \
            2> /dev/null
  9. Output contig FastA file
    ~$ bank2fasta -d -b bank-dir \
            | tee contig-file
  10. Output scaffold FastA file
    ~$ printScaff -merge -e output-prefix.evidence.xml \
            -o output-prefix \
            -s output-prefix.untangle.xml \
            -l output-prefix.library \
            -f scaffold-file \
            2> /dev/null
    Then, scaffold-file is the final scaffolding results in the FastA format.

Advanced topics 3: How to use paired-end information for graph decomposition (>= 1.2.01)

In order to decompose a complecated metagenomic de Bruijn graph into simpler subgraphs, MetaVelvet searches the following X-structure, i.e., chimeric nodes have two incoming edges and two outgoing edges.
Then, MetaVelvet split the X-structures when the following conditions are satisfied:


In order to accurately avoid chimeric contigs/scaffolds caused by subgraph decomposition, we added the following two conditions (>= 1.2.01): Here, we denote paired-end connections between Primary-IN and Primary-OUT or between Secondary-IN and Secondary-OUT as consistent paired-end connections. We also denote paired-end connections between Primary-IN and Secondary-OUT or between Secondary-IN and Primary-OUT as inconsistent paired-end connections.

Based upon our accuracy evaluation, -valid_connections 1 -noise_connections 0 is an appropriate setting when very similar species co-exist in an environment (MetaVelvet can efficiently avoid misassemblies). Otherwise, -valid_connections 0 -noise_connections 0 is a more appropriate setting (MetaVelvet can achieve a greater contig/scaffold N50 while avoiding misassemblies).


Frequently asked questions


Trouble shootings


Contact us


Publications