E. coli 36 bp

Revision as of 26 February 2013 18:34 by admin (Comments | Contribs) | (Evaluation)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length.

Read source

The illuminia read data of E. coli (Paired-end 36 bp sequencing library with 200 bp inserts) were downloaded from Sequence Read Archive (SRA). More than 20.8 M reads

Sequence assembly

  • Set1 (Different Assemblers)
Software Version Parameters Download
ABySS 1.3.0 k=31 Abyss
Velvet 1.1.04 k=29 ins_length=215 cov_cutoff=12 exp_cov=24 min_contig_lgth=100 scaffolding=no Velvet
Edena 3 m=30 Edena
SOAPdenovo 1.05 K=29 M=3 SOAPdenovo
CLC 4.7.2 insert_size_range=194,236 minimum_contig_length=100 CLC

Merged File: Set1_Contig

The above assemblers together with the parameter setting have been selected for de novo assembling of E. coli. After assembly, we discarded contigs with less than 100bp and evaluated the accuracy of the assemblies based on the Mauve Assembly Matrices (the results are shown below). In this set of data, we have different sequence assemblies, each was generated by the different assembler. Because the same assembler performs differently over varying parameter settings, such as kmer, we have tried different parameters for Abyss and SOAPdenovo in the following sets.
  • Set2 (Different parameters for Abyss - the assembler provides the lowest number of contigs in Set1)
Abyss parameter Download
k=29 Abyss_k29
k=31 Abyss_k31
k=33 Abyss_k33

Merged File: Set2_Contig

  • Set3 (Different parameters for SOAPdenovo - the assembler provides the largest number of contigs in Set1)
SOAPdenovo parameter Download
k=29 SOAP_k29
k=31 SOAP_k31
k=33 SOAP_k33

Merged File: Set3_Contig

Contig integrator

  • CISA
Input Download
Set1 CISA_Set1
Set2 CISA_Set2
Set3 CISA_Set3
Set2+Set3 CISA_Set2_3
Set1+2+3+2_3 CISA_Set1+2+3+2_3

The input file for CISA is an assembled set of contigs, e.g., the set1 contains the contigs obtained from Abyss, CLC, Edena, SOAPdenovo, and Velvet.

The integrated contigs generated by CISA can be directly downloaded via the Download link.

  • MAIA
    maia('./assembly_list.txt','./NC_000913.fna')
    

assembly_list.txt:

Edena	./Edena_contigs.fa	2
SOAP    ./SOAP_contigs.fa   2
Velvet  ./Velvet.fa 2
CLC ./CLC.fa    2
Abyss	./Abyss_contigs.fa	3

A file named maia_assembly.fa was generated, containing a draft genome of 4640055 bp with 126761 uncalled bases (Ns). We split the genome into contigs (at >10Ns) and got the set of contigs maia.fa.

In using MAIA for other examples, we found that Matlab crashed while the recursion reached its limitation. We therefore split the reference genome into several parts, and then performed MAIA.

 maia('./assembly_list.txt','./NC_000913_1.fna')
 maia('./assembly_list.txt','./NC_000913_2.fna')
 maia('./assembly_list.txt','./NC_000913_3.fna')

The split references and the integrated results can be downloaded maia_ecoli_36.

  • minimus2 and GAA

All_M2_GAA
Beacuase minimus2 and GAA merge two assemblies at a time, we iteratively integrate the five assemblies in random order.
minimus2: C_S_V_A_E, A_C_V_S_E, S_V_E_C_A, A_E_C_V_S, A_E_S_C_V, E_V_S_C_A, E_V_C_A_S, V_E_S_A_C, V_S_E_A_C, A_E_V_C_S
GAA: E_S_C_V_A, V_S_E_C_A, V_S_C_E_A, S_V_C_E_A, V_E_A_S_C, E_C_V_S_A, V_A_S_C_E, V_C_E_S_A, A_S_C_V_E, S_C_V_E_A

Evaluation

  • Benchmark genome
Eshcherichia coli K12 MG1655
  • Evaluated by Mauve Assembly Metrics to calculate the values for the left columns of "Blast_IntactCDS"
How to score genome assemblies using the Mauve system
  • Evaluated by Blast with Features
  • Evaluated by GAGE to calculate the values for the right columns of "Blast_IntactCDS"
Gage
  • Score with Mauve Assembly Metrics, N50, Blast and GAGE:

Set1

Name NumContigs NumAssemblyBases DCJ_Distance NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed %Missed ExtraBases %Extra BrokenCDS IntactCDS ContigN50^ ContigN90 MaxContigLength N50^ Blast_IntactCDS Units(>200) N50^ cor.Units cor.N50^ Errors,(Indel>=5,Inv,Rel)
Abyss 133 4626205 107 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425 96511 4249 108 96511 116 92933 8,(6,0,2)
CLC 379 4546926 304 100 0 288 287 130550 2.8138 3405 0.0749 62 4258 29767 8447 107342 29905 4228 288 28450 290 28036 2,(0,1,1)
Edena 211 4569446 154 17 0 129 125 86780 1.8704 2078 0.0455 66 4254 54405 13642 186686 57790 4191 182 54405 186 52796 4,(2,1,1)
SOAPdenovo 553 4547211 475 36 0 461 412 124407 2.6814 6972 0.1533 100 4220 17902 5384 103369 17944 4131 450 17892 451 17892 1,(0,0,1)
Velvet 283 4550675 207 138 0 208 203 116542 2.5119 2783 0.0612 74 4246 52474 12537 166094 54359 4194 217 52474 224 49022 8,(5,0,3)
CISA_Set1 72 4627549 70 241 50 91 92 49487 1.0666 32028 0.6921 44 4276 119107 32288 312018 126254 4276 72 126254 83 113511 11,(8,0,3)
GAA# 314 4578451 245 148 7 240 227 97081 2.0924 10516 0.2284 72 4248 49430 13068 157184 51218 4205 249 50305 251 48138 4,(2,0,2)
GAA* 311 4602917 224 156 3 225 216 93476 2.0147 11942 0.2591 76 4244 49990 12208 163308 51085 4208 245 51075 238 47954 5,(3,0,2)
MAIA 110 4513348 96 82 54 100 95 129936 2.8005 1090 0.0242 48 4272 112717 30950 312145 126075 4212 95 126075 97 107674 5,(2,0,3)
minimus2# 155 4598769 133 206 0 143 138 90163 1.9433 32604 0.7058 58 4262 82947 21666 202745 86241 4243 137 85880 144 80493 8,(5,1,2)
minimus2* 73 4597392 67 323 0 96 80 155862 3.3593 102792 2.2503 52 4268 121942 35207 296685 129557 4199 72 127420 83 113511 11,(7,1,3)
GAA (Abyss,Edena) 133 4637982 102 328 93 118 112 54835 1.1819 28888 0.6229 57 4263 96157 26096 222425 96511 4258 108 96511 115 92933 8,(6,0,2)
GAA (A,C,E,S,V) 138 4639673 103 305 93 119 113 54254 1.1693 29292 0.6311 57 4263 96157 26096 222425 96511 4258 108 96511 115 92933 8,(6,0,2)
MAIA (split3) 3 4915920 3 68 9573 89 90 213328 4.5979 464447 9.4478 55 4265 1422748 1422748 1927448 1565724 4217 - - - - -
MAIA (split3&n) 145 4800506 112 68 66 116 110 159557 3.439 124964 2.6031 50 4270 126075 30950 318482 132974 4212 121 145106 103 107674 5,(3,0,2)
minimus2(A,C,E,S,V) 74 4608653 68 285 0 97 78 76881 1.657 35464 0.7695 50 4270 126075 34542 417704 134584 4262 73 134584 83 113511 10,(7,1,2)
minimus2(S,C,V,E,A) 69 4215087 69 214 249 90 78 548181 11.8151 113137 2.6841 51 4269 119108 35441 312145 126233 3855 69 115198 79 105796 10,(5,2,3)

[^] Please note that the ContigN50 calculated by Mauve Assembly Metrics is incorrect (off-by-one error). We have followed the definition of N50 (A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list.) http://www.ncbi.nlm.nih.gov/pubmed/22510764 ref to calculate N50s. The N50 contig size calculated by GAGE using the total reference genome length rather than the sum total of contig lengths. GAGE's cor.N50 values were computed after correcting contigs by breaking them at each error.

[#] Please note that GAA and minimus2 were designed to merge two assemblies at a time, we thus performed all runs and took the average scores.

[*] Please note that the scores of minimus2 and GAA were taken from the average of ten random combinations (details).

CISA integrates the five assemblies (Abyss, CLC, Edena, SOAPdenovo and Velvet from the outer to the inner) and sequentially generates the processed contigs after phase (1), (2), (3) and (4):
Five assemblies.pngPhase1.pngPhase2.pngPhase3.pngPhase4.png
We have visually inspected the assemblies against the reference genome (NC_000913) by using graphic representations, e.g. dot plots. Therefore, we knew that the largest contig generated by minimus2(A,C,E,S,V) was misassembled. Moreover, we found that the merging order dramatically influences the result of minimus2.
Dotplot.jpgMisassembled.jpg
Minimus2 firstly merges SOAPdenovo (the inner) with CLC (the outer), then merges the output (the inner) with Velvet (the outer) in the second run, merges the output (the inner) with Edena (the outer) in the third run, finally merges the output (the inner) with Abyss (the outer) in the fourth run and generates the integrated contigs.
S C.pngoutput V.pngoutput E.pngoutput A.pngoutput.png

Set2-Set3

Name NumContigs NumAssemblyBases DCJ_Distance NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed %Missed ExtraBases %Extra BrokenCDS IntactCDS ContigN50 ContigN90 MaxContigLength !N50 Blast_IntactCDS Units(>200) N50 cor.Units cor.N50 Errors,(Indel>=5,Inv,Rel)
Abyss_k29 130 4634010 108 322 30 118 115 61835 1.3327 40405 0.8719 54 4266 95691 26567 268182 96157 4259 105 96157 111 89001 6,(4,0,2)
Abyss_k31 133 4626205 107 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425 96511 4249 108 96511 116 92933 8,(6,0,2)
Abyss_k33 135 4644184 106 354 338 139 119 66355 1.4302 44937 0.9676 78 4242 89001 24907 268398 96157 4255 112 96157 119 89001 8,(5,0,3)
CISA_Set2 106 4635648 93 321 131 117 104 55343 1.1928 39776 0.858 64 4256 113377 27272 222663 113512 4260 94 113512 101 105936 7,(5,0,2)
SOAP_k29 1373 4582756 479 48 0 466 415 124372 2.6806 7247 0.1581 100 4220 17892 5276 103369 17902 4131 450 17892 451 17892 1,(0,0,1)
SOAP_k31 1295 4583165 519 56 0 510 466 121606 2.621 9201 0.2008 121 4199 17003 4286 77302 17052 4073 502 16924 503 16924 1,(0,0,1)
SOAP_k33 2170 4608265 1519 105 0 1470 1380 126273 2.7216 41165 0.8933 507 3813 5391 1449 22953 5394 3314 1459 5368 1459 5365 1,(0,0,1)
CISA_Set3 440 4532470 383 35 0 379 338 133920 2.8864 5809 0.1282 88 4232 23332 6264 103369 23344 4142 385 23328 385 23328 2,(1,0,1)
CISA_Set2&3 105 4636952 91 344 159 117 103 57192 1.2327 40304 0.8692 60 4260 113377 27272 222663 113512 4261 96 113512 102 105936 6,(5,0,1)
CISA_Set_1_2_3_2&3 72 4637760 72 554 53 112 100 43060 0.9281 37481 0.8082 44 4276 115185 35678 310691 119686 4277 71 119686 80 105929 9,(7,0,2)
As can be seen here, CISA can successfully integrate the sets of contigs and reduce the number of contigs. However, in comparison with integrating assemblies generated from different assemblers (Set 1), varying assemble parameters is less efficient (in the case of Set 2 or Set3) in completing the genome.