E. coli 36 bp

Revision as of 15 November 2012 20:13 by admin (Comments | Contribs)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length.

Read source

The illuminia read data of E. coli (Paired-end 36 bp sequencing library with 200 bp inserts) were downloaded from Sequence Read Archive (SRA). More than 20.8 M reads

Sequence assembly

  • Set1 (Different Assemblers)
Software Version Parameters Download
ABySS 1.3.0 k=31 Abyss
Velvet 1.1.04 k=29 ins_length=215 cov_cutoff=12 exp_cov=24 min_contig_lgth=100 scaffolding=no Velvet
Edena 3 m=30 Edena
SOAPdenovo 1.05 K=29 M=3 SOAPdenovo
CLC 4.7.2 insert_size_range=194,236 minimum_contig_length=100 CLC

Merged File: Set1_Contig

The above assemblers together with the parameter setting have been selected for de novo assembling of E. coli. After assembly, we discarded contigs with less than 100bp and evaluated the accuracy of the assemblies based on the Mauve Assembly Matrices (the results are shown below). In this set of data, we have different sequence assemblies, each was generated by the different assembler. Because the same assembler performs differently over varying parameter settings, such as kmer, we have tried different parameters for Abyss and SOAPdenovo in the following sets.
  • Set2 (Different parameters for Abyss - the assembler provides the lowest number of contigs in Set1)
Abyss parameter Download
k=29 Abyss_k29
k=31 Abyss_k31
k=33 Abyss_k33

Merged File: Set2_Contig

  • Set3 (Different parameters for SOAPdenovo - the assembler provides the largest number of contigs in Set1)
SOAPdenovo parameter Download
k=29 SOAP_k29
k=31 SOAP_k31
k=33 SOAP_k33

Merged File: Set3_Contig


  • Scaffold the contigs using SSPACE
Download
CISA+SSPACE
ABySS+SSPACE
Minimus2+SSPACE

Contig integrator

  • CISA
Input Download
Set1 CISA_Set1
Set2 CISA_Set2
Set3 CISA_Set3
Set2+Set3 CISA_Set2_3
Set1+2+3+2_3 CISA_Set1+2+3+2_3

The input file for CISA is an assembled set of contigs, e.g., the set1 contains the contigs obtained from Abyss, CLC, Edena, SOAPdenovo, and Velvet.

The integrated contigs generated by CISA can be directly downloaded via the Download link.

  • MAIA
    maia('./assembly_list.txt','./NC_000913.fna')
    

assembly_list.txt:

Edena	./Edena_contigs.fa	2
SOAP    ./SOAP_contigs.fa   2
Velvet  ./Velvet.fa 2
CLC ./CLC.fa    2
Abyss	./Abyss_contigs.fa	3

A file named maia_assembly.fa was generated, containing a draft genome of 4640055 bp with 126761 uncalled bases (Ns). We split the genome into contigs (at >10Ns) and got the set of contigs maia.fa.

In using MAIA for other examples, we found that Matlab crashed while the recursion reached its limitation. We therefore split the reference genome into several parts, and then performed MAIA.

 maia('./assembly_list.txt','./NC_000913_1.fna')
 maia('./assembly_list.txt','./NC_000913_2.fna')
 maia('./assembly_list.txt','./NC_000913_3.fna')

The split references and the integrated results can be downloaded maia_ecoli_36.

  • minimus2 and GAA

All_M2_GAA
Beacuase minimus2 and GAA merge two assemblies at a time, we iteratively integrate the five assemblies in random order.
minimus2: C_S_V_A_E, A_C_V_S_E, S_V_E_C_A, A_E_C_V_S, A_E_S_C_V, E_V_S_C_A, E_V_C_A_S, V_E_S_A_C, V_S_E_A_C, A_E_V_C_S
GAA: E_S_C_V_A, V_S_E_C_A, V_S_C_E_A, S_V_C_E_A, V_E_A_S_C, E_C_V_S_A, V_A_S_C_E, V_C_E_S_A, A_S_C_V_E, S_C_V_E_A

Evaluation

  • Benchmark genome
Eshcherichia coli K12 MG1655
  • Evaluated by Mauve Assembly Metrics
How to score genome assemblies using the Mauve system
  • Evaluated by Blast with Features
  • Evaluated by Gage
Gage
  • Score with Mauve metrics:

Set1

Name NumContigs NumAssemblyBases DCJ_Distance NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed %Missed ExtraBases %Extra BrokenCDS IntactCDS ContigN50 ContigN90 MaxContigLength Blast_IntactCDS Units(>200) N50 cor.Units cor.N50 Errors,(Indel>=5,Inv,Rel)
Abyss 133 4626205 107 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425 4249 108 96511 116 92933 8,(6,0,2)
CLC 379 4546926 304 100 0 288 287 130550 2.8138 3405 0.0749 62 4258 29767 8447 107342 4228 288 28450 290 28036 2,(0,1,1)
Edena 211 4569446 154 17 0 129 125 86780 1.8704 2078 0.0455 66 4254 54405 13642 186686 4191 182 54405 186 52796 4,(2,1,1)
SOAPdenovo 553 4547211 475 36 0 461 412 124407 2.6814 6972 0.1533 100 4220 17902 5384 103369 4131 450 17892 451 17892 1,(0,0,1)
Velvet 283 4550675 207 138 0 208 203 116542 2.5119 2783 0.0612 74 4246 52474 12537 166094 4194 217 52474 224 49022 8,(5,0,3)
CISA_Set1 72 4627549 70 241 50 91 92 49487 1.0666 32028 0.6921 44 4276 119107 32288 312018 4276 72 126254 83 113511 11,(8,0,3)
GAA# 314 4578451 245 148 7 240 227 97081 2.0924 10516 0.2284 72 4248 49430 13068 157184 4205 249 50305 251 48138 4,(2,0,2)
GAA* 311 4602917 224 156 3 225 216 93476 2.0147 11942 0.2591 76 4244 49990 12208 163308 4208 245 51075 238 47954 6,(3,0,2)
MAIA 110 4513348 96 82 54 100 95 129936 2.8005 1090 0.0242 48 4272 112717 30950 312145 4212 95 126075 97 107674 5,(2,0,3)
minimus2# 155 4598769 133 206 0 143 138 90163 1.9433 32604 0.7058 58 4262 82947 21666 202745 4243 137 85880 144 80493 8,(5,1,2)
minimus2* 73 4597392 67 323 0 96 80 155862 3.3593 102792 2.2503 52 4268 121942 35207 296685 4199 72 127420 83 113511 11,(7,1,3)
GAA (Abyss,Edena) 133 4637982 102 328 93 118 112 54835 1.1819 28888 0.6229 57 4263 96157 26096 222425 4258 108 96511 115 92933 8,(6,0,2)
GAA (A,C,E,S,V) 138 4639673 103 305 93 119 113 54254 1.1693 29292 0.6311 57 4263 96157 26096 222425 4258 108 96511 115 92933 8,(6,0,2)
MAIA (split3) 3 4915920 3 68 9573 89 90 213328 4.5979 464447 9.4478 55 4265 1422748 1422748 1927448 4217 - - - - -
MAIA (split3&n) 145 4800506 112 68 66 116 110 159557 3.439 124964 2.6031 50 4270 126075 30950 318482 4212 121 145106 103 107674 5,(3,0,2)
minimus2(A,C,E,S,V) 74 4608653 68 285 0 97 78 76881 1.657 35464 0.7695 50 4270 126075 34542 417704 4262 73 134584 83 113511 10,(7,1,2)
minimus2(S,C,V,E,A) 69 4215087 69 214 249 90 78 548181 11.8151 113137 2.6841 51 4269 119108 35441 312145 3855 69 115198 79 105796 10,(5,2,3)

[#] Please note that GAA and minimus2 were designed to merge two assemblies at a time, we thus performed all runs and took the average scores.

[*] Please note that the scores of minimus2 and GAA were taken from the average of ten random combinations (details).

CISA integrates the five assemblies (Abyss, CLC, Edena, SOAPdenovo and Velvet from the outer to the inner) and sequentially generates the processed contigs after phase (1), (2), (3) and (4):
Five assemblies.pngPhase1.pngPhase2.pngPhase3.pngPhase4.png
We have visually inspected the assemblies against the reference genome (NC_000913) by using graphic representations, e.g. dot plots. Therefore, we knew that the largest contig generated by minimus2(A,C,E,S,V) was misassembled. Moreover, we found that the merging order dramatically influences the result of minimus2.
Dotplot.jpgMisassembled.jpg
Minimus2 firstly merges SOAPdenovo (the inner) with CLC (the outer), then merges the output (the inner) with Velvet (the outer) in the second run, merges the output (the inner) with Edena (the outer) in the third run, finally merges the output (the inner) with Abyss (the outer) in the fourth run and generates the integrated contigs.
S C.pngoutput V.pngoutput E.pngoutput A.pngoutput.png

Set2-Set3

Name NumContigs NumAssemblyBases DCJ_Distance NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed %Missed ExtraBases %Extra BrokenCDS IntactCDS ContigN50 ContigN90 MaxContigLength Blast_IntactCDS Units(>200) N50 cor.Units cor.N50 Errors,(Indel>=5,Inv,Rel)
Abyss_k29 130 4634010 108 322 30 118 115 61835 1.3327 40405 0.8719 54 4266 95691 26567 268182 4259 105 96157 111 89001 6,(4,0,2)
Abyss_k31 133 4626205 107 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425 4249 108 96511 116 92933 8,(6,0,2)
Abyss_k33 135 4644184 106 354 338 139 119 66355 1.4302 44937 0.9676 78 4242 89001 24907 268398 4255 112 96157 119 89001 8,(5,0,3)
CISA_Set2 106 4635648 93 321 131 117 104 55343 1.1928 39776 0.858 64 4256 113377 27272 222663 4260 94 113512 101 105936 7,(5,0,2)
SOAP_k29 1373 4582756 479 48 0 466 415 124372 2.6806 7247 0.1581 100 4220 17892 5276 103369 4131 450 17892 451 17892 1,(0,0,1)
SOAP_k31 1295 4583165 519 56 0 510 466 121606 2.621 9201 0.2008 121 4199 17003 4286 77302 4073 502 16924 503 16924 1,(0,0,1)
SOAP_k33 2170 4608265 1519 105 0 1470 1380 126273 2.7216 41165 0.8933 507 3813 5391 1449 22953 3314 1459 5368 1459 5365 1,(0,0,1)
CISA_Set3 440 4532470 383 35 0 379 338 133920 2.8864 5809 0.1282 88 4232 23332 6264 103369 4142 385 23328 385 23328 2,(1,0,1)
CISA_Set2&3 105 4636952 91 344 159 117 103 57192 1.2327 40304 0.8692 60 4260 113377 27272 222663 4261 96 113512 102 105936 6,(5,0,1)
CISA_Set_1_2_3_2&3 72 4637760 72 554 53 112 100 43060 0.9281 37481 0.8082 44 4276 115185 35678 310691 4277 71 119686 80 105929 9,(7,0,2)
As can be seen here, CISA can successfully integrate the sets of contigs and reduce the number of contigs. However, in comparison with integrating assemblies generated from different assemblers (Set 1), varying assemble parameters is less efficient (in the case of Set 2 or Set3) in completing the genome.
  • Scaffold the contigs using SSPACE
Since we have the paired-end reads of E. coli, it is possible to assess the order, distance and orientation of contigs and combine them into scaffolds. We, therefore, used SSPACE to scaffold the contigs and quantified the scaffolds by Mauve assembly metrics.
Name NumContigs NumAssemblyBases DCJ_Distance NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed %Missed ExtraBases %Extra BrokenCDS IntactCDS ContigN50 MaxContigLength Blast_IntactCDS Units(>200) N50 cor.Units cor.N50 Errors,(Indel>=5,Inv,Rel)
CISA+SSPACE 61 4628707 68 362 188 91 93 50162 1.0812 37845 0.8176 45 4275 135136 323055 4276 61 150749 82 113511 17,(9,1,7)
Abyss+SSPACE 97 4628151 81 341 685 114 111 54656 1.178 33130 0.7158 56 4264 126232 289699 4259 85 134584 108 96149 27,(16,1,10)
Abyss scaffolding 181 4653848 72 377 71 82 86 40880 0.8811 40409 0.8683 50 4270 133908 316093 4266 65 135136 78 113739 14,(6,1,7)
minimus2+SSPACE 62 4608909 64 407 219 112 86 65837 1.419 31972 0.6937 49 4271 150458 417726 4262 61 177539 78 119105 16,(8,3,5)
The results show:
  1. The integrated contigs output by CISA can be scaffolded by SSPACE to a limited extent (from 72 to 61), which suggests that our CISA can indeed integrate the sequence information from different assemblies.
  2. To introduce the paired-end reads to the pre-assembled contigs generated by Abyss using SSPACE (Abyss+SSPACE) can only reduce the number of contigs from 133 to 97, smaller than the effect made by CISA (from 133 to 72), which suggests that contig integration prior to scaffolding can further enhance the result.
  3. The problem of misassembled contigs generated by minimus2 is not yet solved by SSPACE, which suggests that we should integrate contigs with caution.