Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length.
Read source
- The illuminia read data of E. coli (Paired-end sequencing library with 200 bp inserts) were downloaded from Sequence Read Archive (SRA). More than 20.8 M reads
Sequence assembly
- Set1 (Different Assemblers)
Software |
Version |
Parameters |
Download |
ABySS |
1.3.0 |
k=31 |
Abyss |
Velvet |
1.1.04 |
k=29 ins_length=215 cov_cutoff=12 exp_cov=24 min_contig_lgth=100 scaffolding=no |
Velvet |
Edena |
3 |
m=30 |
Edena |
SOAPdenovo |
1.05 |
K=29 M=3 |
SOAPdenovo |
CLC |
4.7.2 |
insert_size_range=194,236 minimum_contig_length=100 |
CLC |
Merged File: Set1_Contig
- The above assemblers together with the parameter setting have been selected for de novo assembling of E. coli. After assembly, we discarded contigs with less than 100bp and evaluated the accuracy of the assemblies based on the Mauve Assembly Matrices (the results are shown below). In this set of data, we have different sequence assemblies, each was generated by the different assembler. Because the same assembler performs differently over varying parameter settings, such as kmer, we have tried different parameters for Abyss and SOAPdenovo in the following sets.
- Set2 (Different parameters for Abyss - the assembler provides the lowest number of contigs in Set1)
Merged File: Set2_Contig
- Set3 (Different parameters for SOAPdenovo - the assembler provides the largest number of contigs in Set1)
Merged File: Set3_Contig
- Scaffold the contigs using SSPACE
Contig integrator
The input file for CISA is an assembled set of contigs, e.g., the set1 contains the contigs obtained from Abyss, CLC, Edena, SOAPdenovo, and Velvet.
The integrated contigs generated by CISA can be directly downloaded via the Download link.
Evaluation
- Eshcherichia coli K12 MG1655
- Evaluate by Mauve Assembly Metrics
- How to score genome assemblies using the Mauve system
- Score with Mauve metrics:
Set1
Name |
NumContigs |
NumAssemblyBases |
NumMisCalled |
NumUnCalled |
NumGapsRef |
NumGapsAssembly |
TotalBasesMissed |
PercBasesMissed |
ExtraBases |
PercExtraBases |
BrokenCDS |
IntactCDS |
ContigN50 |
ContigN90 |
MaxContigLength |
Blast_IntactCDS |
Abyss |
133 |
4626205 |
334 |
69 |
123 |
119 |
57847 |
1.2468 |
29424 |
0.636 |
57 |
4263 |
96157 |
26096 |
222425 |
4257 |
CLC |
379 |
4546926 |
100 |
0 |
288 |
287 |
130550 |
2.8138 |
3405 |
0.0749 |
62 |
4258 |
29767 |
8447 |
107342 |
4233 |
Edena |
211 |
4569446 |
17 |
0 |
129 |
125 |
86780 |
1.8704 |
2078 |
0.0455 |
66 |
4254 |
54405 |
13642 |
186686 |
4204 |
SOAPdenovo |
553 |
4547211 |
36 |
0 |
461 |
412 |
124407 |
2.6814 |
6972 |
0.1533 |
100 |
4220 |
17902 |
5384 |
103369 |
4146 |
Velvet |
283 |
4550675 |
138 |
0 |
208 |
203 |
116542 |
2.5119 |
2783 |
0.0612 |
74 |
4246 |
52474 |
12537 |
166094 |
4204 |
CISA_Set1 |
72 |
4626972 |
243 |
50 |
91 |
90 |
49312 |
1.0628 |
31326 |
0.677 |
45 |
4275 |
119107 |
32288 |
310578 |
4290 |
minimus2 |
74 |
4608653 |
285 |
0 |
97 |
78 |
76881 |
1.657 |
35464 |
0.7695 |
50 |
4270 |
126075 |
34542 |
417704 |
4268 |
- We have visually inspected the assemblies against the reference genome (NC_000913) by using graphic representations, e.g. dot plots. Therefore, we knew that the largest contig generated by minimus2 was misassembled.
Set2-Set3
Name |
NumContigs |
NumAssemblyBases |
NumMisCalled |
NumUnCalled |
NumGapsRef |
NumGapsAssembly |
TotalBasesMissed |
PercBasesMissed |
ExtraBases |
PercExtraBases |
BrokenCDS |
IntactCDS |
ContigN50 |
ContigN90 |
MaxContigLength |
Blast_IntactCDS |
Abyss_k29 |
130 |
4634010 |
322 |
30 |
118 |
115 |
61835 |
1.3327 |
40405 |
0.8719 |
54 |
4266 |
95691 |
26567 |
268182 |
4267 |
Abyss_k31 |
133 |
4626205 |
334 |
69 |
123 |
119 |
57847 |
1.2468 |
29424 |
0.636 |
57 |
4263 |
96157 |
26096 |
222425 |
4257 |
Abyss_k33 |
135 |
4644184 |
354 |
338 |
139 |
119 |
66355 |
1.4302 |
44937 |
0.9676 |
78 |
4242 |
89001 |
24907 |
268398 |
4263 |
CISA_Set2 |
106 |
4635666 |
327 |
146 |
116 |
102 |
55420 |
1.1945 |
39743 |
0.8573 |
64 |
4256 |
113377 |
27272 |
222663 |
4269 |
SOAP_k29 |
1373 |
4582756 |
48 |
0 |
466 |
415 |
124372 |
2.6806 |
7247 |
0.1581 |
100 |
4220 |
17892 |
5276 |
103369 |
4146 |
SOAP_k31 |
1295 |
4583165 |
56 |
0 |
510 |
466 |
121606 |
2.621 |
9201 |
0.2008 |
121 |
4199 |
17003 |
4286 |
77302 |
4094 |
SOAP_k33 |
2170 |
4608265 |
105 |
0 |
1470 |
1380 |
126273 |
2.7216 |
41165 |
0.8933 |
507 |
3813 |
5391 |
1449 |
22953 |
3379 |
CISA_Set3 |
440 |
4532901 |
41 |
0 |
379 |
338 |
132794 |
2.8621 |
5627 |
0.1241 |
87 |
4233 |
23332 |
6355 |
103369 |
4166 |
CISA_Set2&3 |
105 |
4636950 |
350 |
160 |
118 |
104 |
55228 |
1.1903 |
40097 |
0.8647 |
61 |
4259 |
113377 |
27272 |
222663 |
4269 |
CISA_Set_1_2_3_2&3 |
72 |
4637760 |
521 |
53 |
109 |
97 |
43006 |
0.9269 |
37427 |
0.807 |
44 |
4276 |
115185 |
35678 |
310691 |
4291 |
- As can be seen here, CISA can successfully integrate the sets of contigs and reduce the number of contigs. However, in comparison with integrating assemblies generated from different assemblers (Set 1), varying assemble parameters is less efficient (in the case of Set 2 or Set3) in completing the genome.
- Scaffold the contigs using SSPACE
- Since we have the paired-end reads of E. coli, it is possible to assess the order, distance and orientation of contigs and combine them into scaffolds. We, therefore, used SSPACE to scaffold the contigs and quantified the scaffolds by Mauve assembly metrics.
Name |
NumContigs |
NumAssemblyBases |
NumMisCalled |
NumUnCalled |
NumGapsRef |
NumGapsAssembly |
TotalBasesMissed |
PercBasesMissed |
ExtraBases |
PercExtraBases |
BrokenCDS |
IntactCDS |
ContigN50 |
MaxContigLength |
Blast_IntactCDS |
CISA+SSPACE |
69 |
4627290 |
237 |
50 |
94 |
93 |
52804 |
1.1381 |
37397 |
0.8082 |
44 |
4276 |
134584 |
416708 |
4290 |
Abyss+SSPACE |
101 |
4627104 |
393 |
735 |
114 |
119 |
54956 |
1.1845 |
33747 |
0.7293 |
57 |
4263 |
107040 |
268750 |
4272 |
minimus2+SSPACE |
64 |
4608774 |
337 |
54 |
93 |
76 |
75502 |
1.6273 |
36021 |
0.7816 |
49 |
4271 |
150458 |
420117 |
4268 |
- The results show:
- The integrated contigs output by CISA can be scaffolded by SSPACE to a limited extent (from 72 to 69), which suggests that our CISA can indeed integrate the sequence information from different assemblies.
- To introduce the paired-end reads to the pre-assembled contigs generated by Abyss using SSPACE (Abyss+SSPACE) can only reduce the number of contigs from 133 to 101, smaller than the effect made by CISA (from 133 to 72), which suggests that contig integration prior to scaffolding can further enhance the result.
- The problem of misassembled contigs generated by minimus2 is not yet solved by SSPACE, which suggests that we should integrate contigs with caution.