E. coli

Revision as of 24 December 2011 01:09 by admin (Comments | Contribs) | (Evaluation)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length.

Read source

The illuminia read data of E. coli (Paired-end sequencing library with 200 bp inserts) were downloaded from Sequence Read Archive (SRA). More than 20.8 M reads

Sequence assembly

  • Set1 (Different Assemblers)
Software Version Parameters Download
ABySS 1.3.0 k=31 Abyss
Velvet 1.1.04 k=29 ins_length=215 cov_cutoff=12 exp_cov=24 min_contig_lgth=100 scaffolding=no Velvet
Edena 3 m=30 Edena
SOAPdenovo 1.05 K=29 M=3 SOAPdenovo
CLC 4.7.2 insert_size_range=194,236 minimum_contig_length=100 CLC

Merged File: Set1_Contig

The above assemblers together with the parameter setting have been selected for de novo assembling of E. coli. After assembly, we discarded contigs with less than 100bp and evaluated the accuracy of the assemblies based on the Mauve Assembly Matrices (the results are shown below). In this set of data, we have different sequence assemblies, each was generated by the different assembler. Because even the same assemble performs differently over varying parameter settings such as kmer, we have tried different parameter settings for Abyss and SOAPdenovo in the following sets.
  • Set2 (Different parameters for Abyss - the assembler provides the lowest number of contigs in Set1)
Abyss parameter Download
k=29 Abyss_k29
k=31 Abyss_k31
k=33 Abyss_k33

Merged File: Set2_Contig

  • Set3 (Different parameters for SOAPdenovo - the assembler provides the largest number of contigs in Set1)
SOAPdenovo parameter Download
k=29 SOAP_k29
k=31 SOAP_k31
k=33 SOAP_k33

Merged File: Set3_Contig


Contig integrator

  • CISA
Input Download
Set1 CISA_Set1
Set2 CISA_Set2
Set3 CISA_Set3
Set2+Set3 CISA_Set2_3
Set1+2+3+2_3 CISA_Set1+2+3+2_3
  • minimus2
Input Download
Set1 minimus2_Set1


Evaluation

  • Benchmark genome
Eshcherichia coli K12 MG1655
  • Evaluate by Mauve Assembly Metrics
How to score genome assemblies using the Mauve system
  • Score with Mauve metrics:

Set1

Name NumContigs NumAssemblyBases NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed PercBasesMissed ExtraBases PercExtraBases BrokenCDS IntactCDS ContigN50 ContigN90 MaxContigLength
Abyss 133 4626205 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425
CLC 379 4546926 100 0 288 287 130550 2.8138 3405 0.0749 62 4258 29767 8447 107342
Edena 211 4569446 17 0 129 125 86780 1.8704 2078 0.0455 66 4254 54405 13642 186686
SOAPdenovo 553 4547211 36 0 461 412 124407 2.6814 6972 0.1533 100 4220 17902 5384 103369
Velvet 283 4550675 138 0 208 203 116542 2.5119 2783 0.0612 74 4246 52474 12537 166094
CISA_Set1 77 4625581 288 73 93 96 52449 1.1304 32037 0.6926 44 4276 115197 32288 310695
minimus2 74 4608653 285 0 97 78 76881 1.657 35464 0.7695 50 4270 126075 34542 417704
We have visually inspected the assemblies against the reference genome (NC_000913) by using graphic representations, e.g. dot plots. Therefore, we knew that the largest contig generated by minimus2 was misassembled.
Dotplot.jpg

Set2-Set3

Name NumContigs NumAssemblyBases NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed PercBasesMissed ExtraBases PercExtraBases BrokenCDS IntactCDS ContigN50 ContigN90 MaxContigLength
Abyss_k29 130 4634010 322 30 118 115 61835 1.3327 40405 0.8719 54 4266 95691 26567 268182
Abyss_k31 133 4626205 334 69 123 119 57847 1.2468 29424 0.636 57 4263 96157 26096 222425
Abyss_k33 135 4644184 354 338 139 119 66355 1.4302 44937 0.9676 78 4242 89001 24907 268398
CISA_Set2 105 4635199 332 130 117 103 55567 1.1976 39517 0.8525 63 4257 113377 27272 222663
SOAP_k29 1373 4582756 48 0 466 415 124372 2.6806 7247 0.1581 100 4220 17892 5276 103369
SOAP_k31 1295 4583165 56 0 510 466 121606 2.621 9201 0.2008 121 4199 17003 4286 77302
SOAP_k33 2170 4608265 105 0 1470 1380 126273 2.7216 41165 0.8933 507 3813 5391 1449 22953
CISA_Set3 465 4546819 117 0 402 366 133247 2.8719 19266 0.4237 95 4225 21543 6065 103369
CISA_Set2&3 105 4636783 351 160 118 104 54999 1.1854 39905 0.8606 60 4260 113377 27272 222663
CISA_Set_1_2_3_2&3 72 4637107 529 53 109 97 43390 0.9352 37158 0.8013 44 4276 115185 35678 310556
  • Scaffold the contigs using SSPACE
Since we have the paired-end reads of E. coli, it is possible to assess the order, distance and orientation of contigs and combine them into scaffolds. We, therefore, used SSPACE to scaffold the contigs and quantified the scaffolds by Mauve assembly metrics.
Name NumContigs NumAssemblyBases NumMisCalled NumUnCalled NumGapsRef NumGapsAssembly TotalBasesMissed PercBasesMissed ExtraBases PercExtraBases BrokenCDS IntactCDS ContigN50 MaxContigLength
CISA+SSPACE 69 4625880 362 157 93 98 52261 1.1264 34643 0.7489 43 4277 126254 316040
Abyss+SSPACE 101 4627104 393 735 114 119 54956 1.1845 33747 0.7293 57 4263 107040 268750
minimus2+SSPACE 64 4608774 337 54 93 76 75502 1.6273 36021 0.7816 49 4271 150458 420117
The results show:
  1. The integrated contigs output by CISA can be scaffolded by SSPACE to a limited extent (from 77 to 69), which suggests that our CISA can indeed integrate the sequence information from different assemblies.
  2. To introduce the paired-end reads to the pre-assembled contigs generated by Abyss using SSPACE (Abyss+SSPACE) can only reduce the number of contigs from 133 to 101, smaller than the effect made by CISA (from 133 to 77), which suggests that contig integration prior to scaffolding can further enhance the result.
  3. The problem of misassembled contigs generated by minimus2 is not yet solved by SSPACE, which suggests that we should integrate contigs with caution.