Hierarchical Genome Assembly Process (HGAP) was proposed in the ref (Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Meth 2013).
Dataset 6 (E. coli K-12 MG1655, 8 SMRT cells)
We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and access the correctness by Quast.
Performance
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
# contigs |
16 |
10 |
14 |
16 |
9 |
18 |
13 |
Largest contig |
2 198 457 |
3 484 877 |
1 936 831 |
1 948 632 |
2 104 087 |
1 169 224 |
1 439 551 |
Total length |
4 808 733 |
4 706 800 |
4 705 398 |
4 745 036 |
4 741 512 |
4 814 718 |
4 749 785 |
N50 |
1 005 770 |
3 484 877 |
966 809 |
1 434 284 |
1 655 500 |
676 526 |
1 268 010 |
Misassemblies |
|
|
|
|
|
|
|
# misassemblies |
19 |
9 |
12 |
15 |
14 |
17 |
11 |
Misassembled contigs length |
2 939 040 |
3 530 352 |
2 949 761 |
3 653 461 |
3 820 624 |
2 387 129 |
3 986 402 |
Mismatches |
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.8 |
0.43 |
0.58 |
1.36 |
0.15 |
0.95 |
0.58 |
# indels per 100kbp |
5.71 |
2.98 |
4.45 |
9.56 |
1.77 |
8.02 |
6.88 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
100 |
99.815 |
99.87 |
100 |
99.995 |
99.979 |
Duplication ratio |
1.037 |
1.016 |
1.017 |
1.025 |
1.022 |
1.038 |
1.025 |
# genes |
4494+3 part |
4494+3 part |
4480+7 part |
4485+9 part |
4494+3 part |
4493+4 part |
4492+5 part |
NGA50 |
615 234 |
1 205 052 |
572 342 |
875 953 |
844 482 |
633 220 |
1 267 242 |
Running Time |
19hr 06m |
13hr 34m |
13hr 21m |
12hr 38m |
21hr 28m |
22hr 56m |
22hr 07m |
Discard Unconvincing Contigs
We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.
Performance
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
# contigs |
7 |
8 |
10 |
12 |
4 |
9 |
12 |
Largest contig |
2 198 457 |
3 4848 77 |
1 936 831 |
1 948 632 |
2 104 087 |
1 169 224 |
1 439 551 |
Total length |
4 706 061 |
4 674 582 |
4 659 277 |
4 682 754 |
4 680 475 |
4 702 993 |
4 739 366 |
N50 |
1 005 770 |
3 484 877 |
966 809 |
1 434 284 |
1 655 500 |
676 526 |
1 268 010 |
Misassemblies |
|
|
|
|
|
|
|
# misassemblies |
10 |
7 |
8 |
9 |
9 |
8 |
10 |
Misassembled contigs length |
2 836 368 |
3 498 134 |
2 903 640 |
3 591 179 |
3 759 587 |
2 275 404 |
3 975 983 |
Mismatches |
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.8 |
0.43 |
0.45 |
1.27 |
0.15 |
0.75 |
0.58 |
# indels per 100kbp |
5.71 |
2.98 |
3.56 |
8.72 |
1.77 |
6.06 |
6.88 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
100 |
99.798 |
99.87 |
100 |
99.995 |
99.979 |
Duplication ratio |
1.014 |
1.009 |
1.006 |
1.012 |
1.009 |
1.014 |
1.023 |
# genes |
4494+3 part |
4494+3 part |
4479+8 part |
4485+9 part |
4494+3 part |
4493+4 part |
4492+5 part |
NGA50 |
615 234 |
1 205 052 |
572 342 |
875 953 |
844 482 |
633 220 |
1 267 242 |
Discard Lower-case bases
After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends.
Performance
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
# contigs |
7 |
8 |
10 |
12 |
4 |
9 |
12 |
Largest contig |
2 196 495 |
3 478 799 |
1 936 007 |
1 948 495 |
2 100 388 |
1 165 497 |
1 438 506 |
Total length |
4 694 972 |
4 662 655 |
4 649 216 |
4 657 587 |
4 668 899 |
4 681 301 |
4 714 790 |
N50 |
1 005 009 |
3 478 799 |
964 998 |
1 433 016 |
1 654 501 |
375 502 |
1 266 511 |
Misassemblies |
|
|
|
|
|
|
|
# misassemblies |
9 |
9 |
7 |
8 |
9 |
7 |
8 |
Misassembled contigs length |
2 210 994 |
3 490 490 |
2 901 005 |
3 496 520 |
3 754 889 |
2 256 498 |
3 197 010 |
Mismatches |
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.63 |
0.28 |
0.22 |
0.91 |
0.15 |
0.54 |
0.47 |
# indels per 100kbp |
5.02 |
2.55 |
1.84 |
7.08 |
1.68 |
4.91 |
6.12 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
99.842 |
99.776 |
99.889 |
100 |
99.985 |
99.979 |
Duplication ratio |
1.012 |
1.008 |
1.005 |
1.006 |
1.006 |
1.009 |
1.018 |
# genes |
4494+3 part |
4485+6 part |
4478+9 part |
4482+11 part |
4494+3 part |
4493+4 part |
4492+5 part |
NGA50 |
614 657 |
1 088 544 |
572 342 |
875 453 |
843 983 |
632 720 |
1 265 743 |
Dataset 7 (M. ruber DSM1279, 4 SMRT cells)
We used all SMRT cells to do assembly and access the correctness by Quast.
Performance
Statistics without reference |
All Data |
# contigs |
3 |
Largest contig |
2 548 031 |
Total length |
3 121 070 |
N50 |
2 548 031 |
Misassemblies |
|
# misassemblies |
1 |
Misassembled contigs length |
2 548 031 |
Mismatches |
|
# mismatches per 100kbp |
0.52 |
# indels per 100kbp |
2.71 |
# N's per 100kbp |
0 |
Genome Statistics |
|
Genome fraction(%) |
99.986 |
Duplication ratio |
1.017 |
# genes |
3103+2 part |
NGA50 |
1 155 126 |
Running Time |
18hr 19m |
Discard Lower-case bases
Statistics without reference |
All Data |
# contigs |
3 |
Largest contig |
2 545 501 |
Total length |
3 115 015 |
N50 |
2 545 501 |
Misassemblies |
|
# misassemblies |
1 |
Misassembled contigs length |
2 545 501 |
Mismatches |
|
# mismatches per 100kbp |
0.42 |
# indels per 100kbp |
2.52 |
# N's per 100kbp |
0 |
Genome Statistics |
|
Genome fraction(%) |
99.986 |
Duplication ratio |
1.015 |
# genes |
3103+2 part |
NGA50 |
1 153 096 |
DataSet3
Performance
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
# contigs |
3 |
3 |
3 |
6 |
Largest contig |
2 934 267 |
2 927 454 |
2 929 942 |
2 226 051 |
Total length |
5 178 932 |
5 176 592 |
5 176 771 |
5 182 410 |
N50 |
2 934 267 |
2 927 454 |
2 929 942 |
2 133 457 |
Misassemblies |
|
|
|
|
# misassemblies |
0 |
1 |
0 |
1 |
Misassembled contigs length |
0 |
2 240 169 |
0 |
13 124 |
Mismatches |
|
|
|
|
# mismatches per 100kbp |
0 |
0.02 |
0.06 |
6.45 |
# indels per 100kbp |
1.05 |
0.54 |
0.6 |
1.88 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
Genome fraction(%) |
100 |
100 |
100 |
99.936 |
Duplication ratio |
1.003 |
1.003 |
1.003 |
1.006 |
# genes |
4338+1 part |
4338+1 part |
4338+1 part |
4335+4 part |
NGA50 |
2 934 267 |
2 927 454 |
2 929 942 |
2 133 457 |
Running Time |
24hr 56m |
17hr 41m |
18hr 14m |
17hr 04m |
Discard Contigs
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
# contigs |
3 |
2 |
2 |
5 |
Largest contig |
2 943 267 |
2 927 454 |
2 929 942 |
2 226 051 |
Total length |
5 178 932 |
5 167 623 |
5 167 190 |
5 172 946 |
N50 |
2 934 267 |
2 927 454 |
2 929 942 |
2 133 457 |
Misassemblies |
|
|
|
|
# misassemblies |
0 |
1 |
0 |
1 |
Misassembled contigs length |
0 |
2 240 169 |
0 |
13 124 |
Mismatches |
|
|
|
|
# mismatches per 100kbp |
0 |
0.04 |
0.08 |
6.45 |
# indels per 100kbp |
1.05 |
0.68 |
0.6 |
1.82 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
Genome fraction(%) |
100 |
99.951 |
99.916 |
99.878 |
Duplication ratio |
1.003 |
1.002 |
1.002 |
1.004 |
# genes |
4338+1 part |
4336+2 part |
4335+3 part |
4333+5 part |
NGA50 |
2 934 267 |
2 927 454 |
2 929 942 |
2 133 457 |
Discard Lower-case bases
Statistics without reference |
All Data |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
# contigs |
3 |
2 |
2 |
5 |
Largest contig |
2 932 503 |
2 925 498 |
2 925 998 |
2 225 051 |
Total length |
5 175 001 |
5 163 999 |
5 162 498 |
5 161 405 |
N50 |
2 932 503 |
2 925 498 |
2 925 998 |
2 131 500 |
Misassemblies |
|
|
|
|
# misassemblies |
0 |
1 |
0 |
0 |
Misassembled contigs length |
0 |
2 238 501 |
0 |
0 |
Mismatches |
|
|
|
|
# mismatches per 100kbp |
0.02 |
0.06 |
9.98 |
6.42 |
# indels per 100kbp |
0.77 |
0.56 |
0.85 |
1.45 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
Genome fraction(%) |
100 |
99.931 |
99.869 |
99.782 |
Duplication ratio |
1.002 |
1.001 |
1.002 |
1.002 |
# genes |
4338+1 part |
4336+2 part |
4331+4 part |
4328+7 part |
NGA50 |
2 932 503 |
2 925 498 |
2 925 998 |
2 131 500 |
DataSet4
We randomly selected four, six and eight SMRT cells three times for each, and access the correctness by Quast.
Performance
Statistics without reference |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
8 SMRT cells : 1st Set |
8 SMRT cells : 2nd Set |
8 SMRT cells : 3rd Set |
# contigs |
5 |
10 |
4 |
11 |
7 |
8 |
6 |
10 |
5 |
Largest contig |
3 770 578 |
4 106 852 |
4 644 754 |
3 785 116 |
4 647 724 |
3 287 965 |
4 649 322 |
4 623 068 |
4 649 308 |
Total length |
4 684 069 |
4 723 363 |
4 671 153 |
4 736 342 |
4 711 060 |
4 708 831 |
4 706 433 |
4 731 334 |
4 691 736 |
N50 |
3 770 578 |
4 106 852 |
4 644 754 |
3 785 116 |
4 647 724 |
3 287 965 |
4 649 322 |
4 623 068 |
4 649 308 |
Misassemblies |
|
|
|
|
|
|
|
|
|
# misassemblies |
10 |
13 |
13 |
15 |
12 |
11 |
11 |
16 |
12 |
Misassembled contigs length |
3 788 648 |
4 700 016 |
4 671 153 |
4 726 005 |
4 685 712 |
3 339 030 |
4 694 303 |
4 698 068 |
4 649 308 |
Mismatches |
|
|
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.47 |
0.56 |
0.37 |
0.19 |
0.11 |
0.15 |
0.13 |
0.43 |
0.17 |
# indels per 100kbp |
1.08 |
4.44 |
0.22 |
1.66 |
0.63 |
0.65 |
0.19 |
4.59 |
0.56 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
100 |
99.994 |
99.999 |
100 |
100 |
100 |
99.99 |
100 |
Duplication ratio |
1.01 |
1.018 |
1.007 |
1.021 |
1.031 |
1.015 |
1.012 |
1.02 |
1.011 |
# genes |
4495+2 part |
4495+2 part |
4493+3 part |
4494+3 part |
4495+2 part |
4495+2 part |
4495+2 part |
4494+3 part |
4495+2 part |
NGA50 |
1 207 217 |
2 558 505 |
1 640 882 |
2 888 022 |
2 834 458 |
1 298 912 |
1 477 605 |
1 344 200 |
2 995 586 |
Running Time |
?hr ?m |
?hr ?m |
?hr ?m |
21hr 05m |
19hr 32m |
21hr 01m |
26hr 46m |
|27hr 52m |
26hr 13m |
Discard Unconvincing Contigs
We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.
Performance
Statistics without reference |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
8 SMRT cells : 1st Set |
8 SMRT cells : 2nd Set |
8 SMRT cells : 3rd Set |
# contigs |
2 |
6 |
1 |
5 |
2 |
4 |
2 |
3 |
2 |
Largest contig |
3 770 578 |
4 106 852 |
4 644 754 |
3 785 116 |
4 647 724 |
3 287 965 |
4 649 322 |
4 623 068 |
4 649 308 |
Total length |
4 651 736 |
4 691 077 |
4 644 754 |
4 675 943 |
4 660 074 |
4 671 197 |
4 664 502 |
4 661 980 |
4 661 084 |
N50 |
3 770 578 |
4 106 852 |
4 644 754 |
3 785 116 |
4 647 724 |
3 287 965 |
4 649 322 |
4 623 068 |
4 649 308 |
Misassemblies |
|
|
|
|
|
|
|
|
|
# misassemblies |
8 |
10 |
10 |
10 |
8 |
7 |
8 |
9 |
9 |
Misassembled contigs length |
3 770 578 |
4 677 561 |
4 644 754 |
4 675 943 |
4 647 724 |
3 301 396 |
4 664 502 |
4 639 404 |
4 649 308 |
Mismatches |
|
|
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.15 |
0.5 |
0.37 |
0.22 |
0.11 |
0.15 |
0.13 |
0.22 |
0.17 |
# indels per 100kbp |
0.47 |
3.34 |
0.22 |
1.47 |
0.63 |
0.65 |
0.19 |
1.44 |
0.56 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
100 |
99.994 |
99.999 |
100 |
100 |
100 |
99.99 |
100 |
Duplication ratio |
1.003 |
1.011 |
1.002 |
1.008 |
1.005 |
1.007 |
1.005 |
1.005 |
1.005 |
# genes |
4494+3 part |
4495+2 part |
4493+3 part |
4493+4 part |
4495+2 part |
4495+2 part |
4495+2 part |
4493+4 part |
4495+2 part |
NGA50 |
1 207 217 |
2 558 505 |
1 640 882 |
2 888 022 |
2 834 458 |
1 298 912 |
1 477 605 |
1 344 200 |
2 995 586 |
Discard Lower-case bases
After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends.
Performance
Statistics without reference |
4 SMRT cells : 1st Set |
4 SMRT cells : 2nd Set |
4 SMRT cells : 3rd Set |
6 SMRT cells : 1st Set |
6 SMRT cells : 2nd Set |
6 SMRT cells : 3rd Set |
8 SMRT cells : 1st Set |
8 SMRT cells : 2nd Set |
8 SMRT cells : 3rd Set |
# contigs |
2 |
6 |
1 |
4 |
2 |
4 |
2 |
3 |
2 |
Largest contig |
3 768 995 |
4 105 501 |
4 644 254 |
3 784 001 |
4 646 000 |
3 287 004 |
4 646 998 |
4 622 502 |
4 647 000 |
Total length |
4 649 500 |
4 678 503 |
4 644 254 |
4 660 999 |
4 655 498 |
4 667 500 |
4 660 992 |
4 660 836 |
4 656 000 |
N50 |
3 768 995 |
4 105 501 |
4 644 254 |
3 784 001 |
4 646 000 |
3 287 004 |
4 646 998 |
4 622 502 |
4 647 000 |
Misassemblies |
|
|
|
|
|
|
|
|
|
# misassemblies |
8 |
10 |
10 |
9 |
8 |
7 |
8 |
9 |
8 |
Misassembled contigs length |
3 768 995 |
4 666 999 |
4 644 254 |
4 660 999 |
4 646 000 |
3 299 005 |
4 660 992 |
4 638 338 |
4 647 000 |
Mismatches |
|
|
|
|
|
|
|
|
|
# mismatches per 100kbp |
0.15 |
0.5 |
0.37 |
0.19 |
0.11 |
0.11 |
0.13 |
0.22 |
0.17 |
# indels per 100kbp |
0.37 |
2.93 |
0.22 |
1.44 |
0.54 |
0.58 |
0.19 |
1.34 |
0.47 |
# N's per 100kbp |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Genome Statistics |
|
|
|
|
|
|
|
|
|
Genome fraction(%) |
100 |
100 |
99.994 |
99.999 |
100 |
100 |
100 |
99.99 |
100 |
Duplication ratio |
1.002 |
1.008 |
1.002 |
1.005 |
1.004 |
1.006 |
1.005 |
1.005 |
1.004 |
# genes |
4494+3 part |
4494+3 part |
4493+3 part |
4493+4 part |
4495+2 part |
4495+2 part |
4495+2 part |
4493+4 part |
4495+2 part |
NGA50 |
1 207 217 |
2 558 154 |
1 640 382 |
2 888 022 |
2 833 234 |
1 298 912 |
1 476 281 |
1 344 200 |
2 995 586 |