Version Differences for E. coli 100 bp

Line 1:
    + ''Escherichia coli'' K12 MG1655. The ''E. coli'' MG1655 consists of a circular chromosome of 4,639,675 bp in length.  
       
    + == Read source ==  
       
    + : The illuminia read data of ''E. coli'' (Paired-end sequencing library with 200 bp inserts) were downloaded from Illumina ([ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/Ecoli_MG1655_PE_bfast.bam]). More than 20.8 M reads  
       
       
       
    + == Sequence assembly ==  
       
    + * '''Set1''' (Different Assemblers)  
       
    + <table border=1>  
    + <tr>  
    + <td>'''''Software'''''</td><td>'''''Version'''''</td><td>'''''Parameters'''''</td><td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>ABySS</td><td>1.3.0</td><td>k=31</td><td>[[Media:Abyss contigs.fa|Abyss]]</td></tr>  
    + <tr><td>Velvet</td><td>1.1.04</td><td>k=29 ins_length=215 cov_cutoff=12 exp_cov=24 min_contig_lgth=100 scaffolding=no</td><td>[[Media:Velvetcontigs.fa|Velvet]]</td></tr>  
    + <tr><td>Edena</td><td>3</td><td>m=30</td><td>[[Media:Edena contigs.fa|Edena]]</td></tr>  
    + <tr><td>SOAPdenovo</td><td>1.05</td><td>K=29 M=3</td><td>[[Media:SOAPdenovo contigs.fa|SOAPdenovo]]</td></tr>  
    + <tr><td>CLC</td><td>4.7.2</td><td>insert_size_range=194,236 minimum_contig_length=100</td><td>[[Media:CLC contigs.fa|CLC]]</td></tr>  
    + </table>  
    + Merged File: [[Media:Set1 Contigs.fa|Set1_Contig]]  
    + : The above assemblers together with the parameter setting have been selected for ''de novo'' assembling of ''E. coli''. After assembly, we discarded contigs with less than 100bp and evaluated the accuracy of the assemblies based on the Mauve Assembly Matrices (the results are shown below). In this set of data, we have different sequence assemblies, each was generated by the different assembler. Because the same assembler performs differently over varying parameter settings, such as kmer, we have tried different parameters for Abyss and SOAPdenovo in the following sets.  
       
    +  
    + * '''Set2''' (Different parameters for Abyss - the assembler provides the lowest number of contigs in Set1)  
       
    + <table border=1>  
    + <tr>  
    + <td>'''''Abyss parameter'''''</td><td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>k=29</td><td>[[Media:Abyss contigs 29.fa|Abyss_k29]]</td></tr>  
    + <tr><td>k=31</td><td>[[Media:Abyss contigs 31.fa|Abyss_k31]]</td></tr>  
    + <tr><td>k=33</td><td>[[Media:Abyss contigs 33.fa|Abyss_k33]]</td></tr>  
    + </table>  
    + Merged File: [[Media:Set2 Contigs.fa|Set2_Contig]]  
       
    + * '''Set3''' (Different parameters for SOAPdenovo - the assembler provides the largest number of contigs in Set1)  
       
    + <table border=1>  
    + <tr>  
    + <td>'''''SOAPdenovo parameter'''''</td><td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>k=29</td><td>[[Media:Soap kmer29.fa|SOAP_k29]]</td></tr>  
    + <tr><td>k=31</td><td>[[Media:Soap kmer31.fa|SOAP_k31]]</td></tr>  
    + <tr><td>k=33</td><td>[[Media:Soap kmer33.fa|SOAP_k33]]</td></tr>  
    + </table>  
    + Merged File: [[Media:Set3 Contigs.fa|Set3_Contig]]  
       
       
    + *'''Scaffold the contigs using SSPACE'''  
    + <table border=1>  
    + <tr>  
    + <td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>[[Media:CISA_SSPACE.fa|CISA+SSPACE]]</td></tr>  
    + <tr><td>[[Media:ABySS_SSPACE.fa|ABySS+SSPACE]]</td></tr>  
    + <tr><td>[[Media:M2_SSPACE.fa|Minimus2+SSPACE]]</td></tr>  
    + </table>  
       
    + == Contig integrator ==  
       
    + * '''CISA'''  
       
    + <table border=1>  
    + <tr>  
    + <td>'''''Input'''''</td><td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>[[Media:Set1 Contigs.fa|Set1]]</td><td>[[Media:CISA_Set1_Contigs.fa|CISA_Set1]]</td></tr>  
    + <tr><td>[[Media:Set2 Contigs.fa|Set2]]</td><td>[[Media:CISA_Set2_Contigs.fa|CISA_Set2]]</td></tr>  
    + <tr><td>[[Media:Set3 Contigs.fa|Set3]]</td><td>[[Media:CISA_Set3_Contigs.fa|CISA_Set3]]</td></tr>  
    + <tr><td>[[Media:Set_2_3_Contigs.fa|Set2+Set3]]</td><td>[[Media:CISA_Set2_3_Contigs.fa|CISA_Set2_3]]</td></tr>  
    + <tr><td>[[Media:Set_1_2_3_23_Contigs.fa|Set1+2+3+2_3]]</td><td>[[Media:CISA_Set_1_2_3_23_Contigs.fa|CISA_Set1+2+3+2_3]]</td></tr>  
    + </table>  
    + The input file for CISA is an assembled set of contigs, e.g., the set1 contains the contigs obtained from Abyss, CLC, Edena, SOAPdenovo, and Velvet.  
       
    + The integrated contigs generated by CISA can be directly downloaded via the Download link.  
       
    + * '''MAIA'''  
    + maia('./assembly_list.txt','./NC_000913.fna')  
       
    + assembly_list.txt:  
    + Edena ./Edena_contigs.fa 2  
    + SOAP ./SOAP_contigs.fa 2  
    + Velvet ./Velvet.fa 2  
    + CLC ./CLC.fa 2  
    + Abyss ./Abyss_contigs.fa 3  
       
    + A file named [[Media:maia_assembly.fa|maia_assembly.fa]] was generated, containing a draft genome of 4640055 bp with 126761 uncalled bases (Ns). We split the genome into contigs (at >10Ns) and got the set of contigs [[Media: maia.fa|maia.fa]].  
    +  
    + * '''minimus2'''  
       
    + <table border=1>  
    + <tr>  
    + <td>'''''Input'''''</td><td>'''''Download'''''</td>  
    + </tr>  
    + <tr><td>Set1</td><td>[[Media:minimus2.fasta|minimus2_Set1(A,C,E,S,V)]]</td></tr>  
    + <tr><td>Set1</td><td>[[Media:minimus2(2).fasta|minimus2_Set1(S,C,V,E,A)]]</td></tr>  
    + </table>  
    + Since minimus2 can only merge two sequence sets at a time, we used minimus2 to integrate the set1 in alphabetical order (i.e., ACESV) and in descending order of contig number (i.e., SCEVA).  
       
    + *'''GAA'''  
       
    + After formatting the contig names in the format of >Contig1.1, we conducted GAA to integrate the assembly of Abyss with the assembly of Edena by using the following script and got the set of contigs [[Media: gaa2.fa|gaa2.fa]].  
    + perl gaa.pl -t ../Abyss_contigs_1.fa -q ../CLC_1.fa -o AC  
       
    + Similarly, we can iteratively integrate the output assembly with another assembly to get the set of contigs [[Media:gaa5.fa|gaa5.fa]] (A+C+E+S+V).  
    + == Evaluation ==  
       
    + * '''Benchmark genome'''  
       
    + : [[Media:NC 000913.gbk|Eshcherichia coli K12 MG1655]]  
       
    + * '''Evaluate by Mauve Assembly Metrics'''  
       
    + : [http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve How to score genome assemblies using the Mauve system]  
       
    + *'''Score with Mauve metrics:'''  
    + '''''Set1'''''  
    + {| {{table}} border="1"  
    + | align="center" style="background:#f0f0f0;"|'''Name'''  
    + | align="center" style="background:#f0f0f0;"|'''NumContigs'''  
    + | align="center" style="background:#f0f0f0;"|'''NumAssemblyBases'''  
    + | align="center" style="background:#f0f0f0;"|'''DCJ_Distance'''  
    + | align="center" style="background:#f0f0f0;"|'''NumMisCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumUnCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsRef'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsAssembly'''  
    + | align="center" style="background:#f0f0f0;"|'''TotalBasesMissed'''  
    + | align="center" style="background:#f0f0f0;"|'''%Missed'''  
    + | align="center" style="background:#f0f0f0;"|'''ExtraBases'''  
    + | align="center" style="background:#f0f0f0;"|'''%Extra'''  
    + | align="center" style="background:#f0f0f0;"|'''BrokenCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''IntactCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''ContigN50'''  
    + | align="center" style="background:#f0f0f0;"|'''ContigN90'''  
    + | align="center" style="background:#f0f0f0;"|'''MaxContigLength'''  
    + | align="center" style="background:#eee685;"|'''Blast_IntactCDS'''  
    + |-  
    + | Abyss||133||4626205||108||334||69||123||119||57847||1.2468||29424||0.636||57||4263||96157||26096||222425||4257  
    + |-  
    + | CLC||379||4546926||304||100||0||288||287||130550||2.8138||3405||0.0749||62||4258||29767||8447||107342||4233  
    + |-  
    + | Edena||211||4569446||154||17||0||129||125||86780||1.8704||2078||0.0455||66||4254||54405||13642||186686||4204  
    + |-  
    + | SOAPdenovo||553||4547211||475||36||0||461||412||124407||2.6814||6972||0.1533||100||4220||17902||5384||103369||4146  
    + |-  
    + | Velvet||283||4550675||207||138||0||208||203||116542||2.5119||2783||0.0612||74||4246||52474||12537||166094||4204  
    + |-bgcolor="#66CCFF"  
    + | CISA_Set1||72||4627549||70||241||50||91||92||49487||1.0666||32028||0.6921||44||4276||119107||32288||312018||4290  
    + |-  
    + |-  
    + | MAIA||110||4513348||96||82||54||100||95||129936||2.8005||1090||0.0242||48||4272||112717||30950||312145||4222  
    + |-  
    + | minimus2(A,C,E,S,V)||74||4608653||68||285||0||97||78||76881||1.657||35464||0.7695||50||4270||126075||34542||417704||4268  
    + |-  
    + | minimus2(S,C,V,E,A)||69||4215087||69||214||249||90||78||548181||11.8151||113137||2.6841||51||4269||119108||35441||312145||3869  
    + |-  
    + | GAA (Abyss,Edena)||133||4637982||102||328||93||118||112||54835||1.1819||28888||0.6229||57||4263||96157||26096||222425||4267  
    + |-  
    + | GAA (A,C,E,S,V)||138||4639673||103||305||93||119||113||54254||1.1693||29292||0.6311||57||4263||96157||26096||222425||4267  
    + |-  
    + |}  
       
    + : CISA integrates the five assemblies (Abyss, CLC, Edena, SOAPdenovo and Velvet from the outer to the inner) and sequentially generates the processed contigs after phase (1), (2), (3) and (4):  
       
    + : [[Image:Five assemblies.png||250px]][[Image: Phase1.png||250px]][[Image: Phase2.png||250px]][[Image: Phase3.png||250px]][[Image: Phase4.png||250px]]  
       
       
       
       
       
    + : We have visually inspected the assemblies against the reference genome (NC_000913) by using graphic representations, e.g. dot plots. Therefore, we knew that the largest contig generated by minimus2(A,C,E,S,V) was misassembled. Moreover, we found that the merging order dramatically influences the result of minimus2.  
       
    + : [[Image:Dotplot.jpg||500px]][[Image:Misassembled.jpg||500px]]  
       
       
       
       
       
    + : Minimus2 firstly merges SOAPdenovo (the inner) with CLC (the outer), then merges the output (the inner) with Velvet (the outer) in the second run, merges the output (the inner) with Edena (the outer) in the third run, finally merges the output (the inner) with Abyss (the outer) in the fourth run and generates the integrated contigs.  
       
    + : [[Image:S_C.png||250px]][[Image:output_V.png||250px]][[Image:output_E.png||250px]][[Image:output_A.png||250px]][[Image:output.png||250px]]  
       
    + '''''Set2-Set3'''''  
    + {| {{table}} border="1"  
    + | align="center" style="background:#f0f0f0;"|'''Name'''  
    + | align="center" style="background:#f0f0f0;"|'''NumContigs'''  
    + | align="center" style="background:#f0f0f0;"|'''NumAssemblyBases'''  
    + | align="center" style="background:#f0f0f0;"|'''DCJ_Distance'''  
    + | align="center" style="background:#f0f0f0;"|'''NumMisCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumUnCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsRef'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsAssembly'''  
    + | align="center" style="background:#f0f0f0;"|'''TotalBasesMissed'''  
    + | align="center" style="background:#f0f0f0;"|'''%Missed'''  
    + | align="center" style="background:#f0f0f0;"|'''ExtraBases'''  
    + | align="center" style="background:#f0f0f0;"|'''%Extra'''  
    + | align="center" style="background:#f0f0f0;"|'''BrokenCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''IntactCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''ContigN50'''  
    + | align="center" style="background:#f0f0f0;"|'''ContigN90'''  
    + | align="center" style="background:#f0f0f0;"|'''MaxContigLength'''  
    + | align="center" style="background:#eee685;"|'''Blast_IntactCDS'''  
    + |-  
    + | Abyss_k29||130||4634010||108||322||30||118||115||61835||1.3327||40405||0.8719||54||4266||95691||26567||268182||4267  
    + |-  
    + | Abyss_k31||133||4626205||107||334||69||123||119||57847||1.2468||29424||0.636||57||4263||96157||26096||222425||4257  
    + |-  
    + | Abyss_k33||135||4644184||106||354||338||139||119||66355||1.4302||44937||0.9676||78||4242||89001||24907||268398||4263  
    + |-bgcolor="#66CCFF"  
    + | CISA_Set2||106||4635648||93||321||131||117||104||55343||1.1928||39776||0.858||64||4256||113377||27272||222663||4268  
    + |-  
    + | SOAP_k29||1373||4582756||479||48||0||466||415||124372||2.6806||7247||0.1581||100||4220||17892||5276||103369||4146  
    + |-  
    + | SOAP_k31||1295||4583165||519||56||0||510||466||121606||2.621||9201||0.2008||121||4199||17003||4286||77302||4094  
    + |-  
    + | SOAP_k33||2170||4608265||1519||105||0||1470||1380||126273||2.7216||41165||0.8933||507||3813||5391||1449||22953||3379  
    + |-bgcolor="#66CCFF"  
    + | CISA_Set3||440||4532470||383||35||0||379||338||133920||2.8864||5809||0.1282||88||4232||23332||6264||103369||4165  
    + |-bgcolor="#66CCHH"  
    + | CISA_Set2&3||105||4636952||91||344||159||117||103||57192||1.2327||40304||0.8692||60||4260||113377||27272||222663||4269  
    + |-bgcolor="#66CCHH"  
    + | CISA_Set_1_2_3_2&3||72||4637760||72||554||53||112||100||43060||0.9281||37481||0.8082||44||4276||115185||35678||310691||4291  
    + |-  
    + |}  
    + : As can be seen here, CISA can successfully integrate the sets of contigs and reduce the number of contigs. However, in comparison with integrating assemblies generated from different assemblers (Set 1), varying assemble parameters is less efficient (in the case of Set 2 or Set3) in completing the genome.  
       
    + *'''Scaffold the contigs using SSPACE'''  
       
    + : Since we have the paired-end reads of ''E. coli'', it is possible to assess the order, distance and orientation of contigs and combine them into scaffolds. We, therefore, used [http://www.ncbi.nlm.nih.gov/pubmed/21149342 SSPACE] to scaffold the contigs and quantified the scaffolds by Mauve assembly metrics.  
       
    + {| {{table}} border="1"  
    + | align="center" style="background:#f0f0f0;"|'''Name'''  
    + | align="center" style="background:#f0f0f0;"|'''NumContigs'''  
    + | align="center" style="background:#f0f0f0;"|'''NumAssemblyBases'''  
    + | align="center" style="background:#f0f0f0;"|'''DCJ_Distance'''  
    + | align="center" style="background:#f0f0f0;"|'''NumMisCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumUnCalled'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsRef'''  
    + | align="center" style="background:#f0f0f0;"|'''NumGapsAssembly'''  
    + | align="center" style="background:#f0f0f0;"|'''TotalBasesMissed'''  
    + | align="center" style="background:#f0f0f0;"|'''%Missed'''  
    + | align="center" style="background:#f0f0f0;"|'''ExtraBases'''  
    + | align="center" style="background:#f0f0f0;"|'''%Extra'''  
    + | align="center" style="background:#f0f0f0;"|'''BrokenCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''IntactCDS'''  
    + | align="center" style="background:#f0f0f0;"|'''ContigN50'''  
    + | align="center" style="background:#f0f0f0;"|'''MaxContigLength'''  
    + | align="center" style="background:#eee685;"|'''Blast_IntactCDS'''  
    + |-  
    + | CISA+SSPACE||69||4627867||70||237||50||90||91||52150||1.124||37320||0.8064||44||4276||134584||418148||4290  
    + |-  
    + | Abyss+SSPACE||101||4627104||89||393||735||114||119||54956||1.1845||33747||0.7293||57||4263||107040||268750||4272  
    + |-  
    + | minimus2+SSPACE||64||4608774||64||337||54||93||76||75502||1.6273||36021||0.7816||49||4271||150458||420117||4268  
    + |-  
    + |}  
       
    + : The results show:  
    + # The integrated contigs output by CISA can be scaffolded by SSPACE to a limited extent (from 72 to 69), which suggests that our CISA can indeed integrate the sequence information from different assemblies.  
       
    + # To introduce the paired-end reads to the pre-assembled contigs generated by Abyss using SSPACE (Abyss+SSPACE) can only reduce the number of contigs from 133 to 101, smaller than the effect made by CISA (from 133 to 72), which suggests that contig integration prior to scaffolding can further enhance the result.  
       
    + # The problem of misassembled contigs generated by minimus2 is not yet solved by SSPACE, which suggests that we should integrate contigs with caution.