SCA

Revision as of 13 March 2014 23:49 by admin (Comments | Contribs) | (Performance)

Self-correction approach (SCA) was proposed in the ref (Reducing assembly complexity of microbial genomes with single-molecule sequencin, Genome Biology 2013).

Contents

Dataset 5 (E. coli K-12 MG1655, 17 SMRT cells)

We randomly selected four, six and eight SMRT cells three times for each, and access the correctness by Quast.

Performance

Statistics without reference 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set 8 SMRT cells : 1st Set 8 SMRT cells : 2nd Set 8 SMRT cells : 3rd Set 17 SMRT cells
# contigs 1 1 5 2 2 1 4 1 2 1
Largest contig 4 647 117 4 648 057 3 447 068 3 749 516 2 770 859 4 649 699 1 679 082 4 649 323 4 189 785 4 651 604
Total length 4 647 117 4 648 057 4 661 453 4 645 941 4 657 272 4 649 699 4 655 949 4 649 323 4 652 482 4 651 604
N50 4 647 117 4 648 057 3 447 068 3 749 516 2 770 859 4 649 699 1 159 845 4 649 323 4 189 785 4 651 604
Misassemblies
# misassemblies 7 8 7 7 10 10 6 10 8 9
Misassembled contigs length 4 647 117 4 648 057 3 447 068 6 645 941 4 657 272 4 649 699 2 143 406 4 649 323 4 189 785 4 651 604
Mismatches
# mismatches per 100kbp 0.47 0.56 0.37 0.19 0.11 0.15 0.13 0.43 0.17
# indels per 100kbp 1.08 4.44 0.22 1.66 0.63 0.65 0.19 4.59 0.56
# N's per 100kbp 0 0 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.994 99.999 100 100 100 99.99 100
Duplication ratio 1.01 1.018 1.007 1.021 1.031 1.015 1.012 1.02 1.011
# genes 4495+2 part 4495+2 part 4493+3 part 4494+3 part 4495+2 part 4495+2 part 4495+2 part 4494+3 part 4495+2 part
NGA50 1 207 217 2 558 505 1 640 882 2 888 022 2 834 458 1 298 912 1 477 605 1 344 200 2 995 586
Running Time ?hr ?m ?hr ?m ?hr ?m 21hr 05m 19hr 32m 21hr 01m 26hr 46m |27hr 52m 26hr 13m


Dataset 6 (E.coli K-12 MG1655, 8 SMRT cells)

We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list.

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 2 8 10 14 1 1 4
Largest contig 4 278 957 2 277 010 1 213 670 984 459 4 641 350 4 640 250 3 162 440
Total length 4 650 771 4 648 304 4 644 602 4 656 274 4 641 350 4 640 250 4 653 394
N50 4 278 957 2 043 590 2 044 147 2 135 225 3 162 440 4 640 250 4 641 350
Misassemblies
# misassemblies 8 10 8 6 7 7 8
Misassembled contigs length 4 278 957 2 809 129 2 085 482 1 947 163 4 641 350 4 640 250 3 209 090
Mismatches
# mismatches per 100kbp 0.37 2.49 1.88 5.38 0.69 0.67 0.86
# indels per 100kbp 3.64 56.81 47.62 77.31 10.67 12.87 11
# N's per 100kbp 0 0.04 0.02 0.09 0 0 0
Genome Statistics
Genome fraction(%) 99.93 99.733 99.67 99.693 99.972 99.946 99.968
Duplication ratio 1.003 1.006 1.005 1.008 1.001 1.001 1.005
# genes 4492+5 part 4475+10 part 4467+12 part 4469+13 part 4492+4 part 4491+4 part 4492+4 part
NGA50 1 207 233 531 351 721 189 565 251 2 499 057 2 499 697 1 267 262
Running Time 15hr 41m 7hr 32m 7hr 10m 5hr 42m 15hr 44m 16hr 02m 13hr 27m

Dataset 7, (M. ruber DSM1279, 4 SMRT cells)

We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome (NC_013946).

Performance

Statistics without reference All Data
# contigs 2
Largest contig 2 974 307
Total length 3 100 289
N50 2 974 307
Misassemblies
# misassemblies 3
Misassembled contigs length 2 974 307
Mismatches
# mismatches per 100kbp 0.23
# indels per 100kbp 5.04
# N's per 100kbp 0.03
Genome Statistics
Genome fraction(%) 99.883
Duplication ratio 1.002
# genes 3093+4 part
NGA50 1 715 029
Running Time 8hr 7m

Dataset 8 (P. heparinus DSM1279, 7 SMRT cells)

We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_013061).

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set
# contigs 1 3 3 3
Largest contig 5 163 983 2 232 679 2 236 613 2 237 949
Total length 5 163 983 5 161 276 5 165 518 5 166 563
N50 5 163 983 2 043 590 2 044 147 2 135 225
Misassemblies
# misassemblies 1 0 0 0
Misassembled contigs length 5 163 983 0 0 0
Mismatches
# mismatches per 100kbp 8.41 9.960 8.27 10.29
# indels per 100kbp 2.19 21.34 13.29 14.78
# N's per 100kbp 0 0 0 0
Genome Statistics
Genome fraction(%) 99.919 99.864 99.907 99.89
Duplication ratio 1.001 1.001 1.002 1.002
# genes 4335+3 part 4330+5 part 4333+5 part 4333+3 part
NGA50 4 300 532 2 043 590 2 044 147 2 135 225
Running Time 21hr 36m 11hr 39m 12hr 26m 12hr 12m