HGAP

Revision as of 9 October 2013 19:33 by admin (Comments | Contribs) | (Discard Lower-case bases)

Hierarchical Genome Assembly Process (HGAP) was proposed in the ref (Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Meth 2013).

Contents

DataSet1

We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and access the correctness by Quast.

Randomly Selected Four SMRT cells

First Set

Random Get @m121024_100442_42178_c100389662550000001523034410251205_s1_p0
Random Get @m121024_122509_42178_c100389662550000001523034410251206_s1_p0
Random Get @m121023_202553_42178_c100389662550000001523034410251200_s1_p0
Random Get @m121024_010654_42178_c100389662550000001523034410251202_s1_p0

Second Set

Random Get @m121024_122509_42178_c100389662550000001523034410251206_s1_p0
Random Get @m121023_202553_42178_c100389662550000001523034410251200_s1_p0
Random Get @m121024_032737_42178_c100389662550000001523034410251203_s1_p0
Random Get @m121024_144608_42178_c100389662550000001523034410251207_s1_p0

Third Set

Random Get @m121024_032737_42178_c100389662550000001523034410251203_s1_p0
Random Get @m121024_010654_42178_c100389662550000001523034410251202_s1_p0
Random Get @m121023_224605_42178_c100389662550000001523034410251201_s1_p0
Random Get @m121024_074656_42178_c100389662550000001523034410251204_s1_p0

Randomly Selected Six SMRT cells

First Set

Random Get @m121024_100442_42178_c100389662550000001523034410251205_s1_p0
Random Get @m121023_224605_42178_c100389662550000001523034410251201_s1_p0
Random Get @m121023_202553_42178_c100389662550000001523034410251200_s1_p0
Random Get @m121024_032737_42178_c100389662550000001523034410251203_s1_p0
Random Get @m121024_074656_42178_c100389662550000001523034410251204_s1_p0
Random Get @m121024_144608_42178_c100389662550000001523034410251207_s1_p0

Second Set

Random Get @m121024_074656_42178_c100389662550000001523034410251204_s1_p0
Random Get @m121023_224605_42178_c100389662550000001523034410251201_s1_p0
Random Get @m121024_032737_42178_c100389662550000001523034410251203_s1_p0
Random Get @m121024_144608_42178_c100389662550000001523034410251207_s1_p0
Random Get @m121024_010654_42178_c100389662550000001523034410251202_s1_p0
Random Get @m121024_100442_42178_c100389662550000001523034410251205_s1_p0

Third Set

Random Get @m121023_224605_42178_c100389662550000001523034410251201_s1_p0
Random Get @m121023_202553_42178_c100389662550000001523034410251200_s1_p0
Random Get @m121024_032737_42178_c100389662550000001523034410251203_s1_p0
Random Get @m121024_122509_42178_c100389662550000001523034410251206_s1_p0
Random Get @m121024_010654_42178_c100389662550000001523034410251202_s1_p0
Random Get @m121024_144608_42178_c100389662550000001523034410251207_s1_p0

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 16 10 14 16 9 18 13
Largest contig 2 198 457 3 4848 77 1 936 831 1 948 632 2 104 087 1 169 224 1 439 551
Total length 4 808 733 4 706 800 4 705 398 4 745 036 4 741 512 4 814 718 4 749 785
N50 1 005 770 3 484 877 966 809 1 434 284 1 655 500 676 526 1 268 010
Misassemblies
# misassemblies 19 9 12 15 14 17 11
Misassembled contigs length 2 939 040 3 530 352 2 949 761 3 653 461 3 820 624 2 387 129 3 986 402
Mismatches
# mismatches per 100kbp 0.8 0.43 0.58 1.36 0.15 0.95 0.58
# indels per 100kbp 5.71 2.98 4.45 9.56 1.77 8.02 6.88
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.815 99.87 100 99.995 99.979
Duplication ratio 1.037 1.016 1.017 1.025 1.022 1.038 1.025
# genes 4494+3 part 4494+3 part 4480+7 part 4485+9 part 4494+3 part 4493+4 part 4492+5 part
NGA50 615 234 1 205 052 572 342 875 953 844 482 633 220 1 267 242
Running Time 19hr 06m 13hr 34m 13hr 21m 12hr 38m


Discard Unconvincing Contigs

We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 7 8 10 12 4 9 12
Largest contig 2 198 457 3 4848 77 1 936 831 1 948 632 2 104 087 1 169 224 1 439 551
Total length 4 706 061 4 674 582 4 659 277 4 682 754 4 680 475 4 702 993 4 739 366
N50 1 005 770 3 484 877 966 809 1 434 284 1 655 500 676 526 1 268 010
Misassemblies
# misassemblies 10 7 8 9 9 8 10
Misassembled contigs length 2 836 368 3 498 134 2 903 640 3 591 179 3 759 587 2 275 404 3 975 983
Mismatches
# mismatches per 100kbp 0.8 0.43 0.45 1.27 0.15 0.75 0.58
# indels per 100kbp 5.71 2.98 3.56 8.72 1.77 6.06 6.88
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.798 99.87 100 99.995 99.979
Duplication ratio 1.014 1.009 1.006 1.012 1.009 1.014 1.023
# genes 4494+3 part 4494+3 part 4479+8 part 4485+9 part 4494+3 part 4493+4 part 4492+5 part
NGA50 615 234 1 205 052 572 342 875 953 844 482 633 220 1 267 242



Discard Lower-case bases

We discarded low quality bases which present in lower-case from contigs two-side ends.

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 7 8 10 12 4 9 12
Largest contig 2 196 495 3 478 799 1 936 007 1 948 495 2 100 388 1 165 497 1 438 506
Total length 4 694 972 4 662 655 4 649 216 4 657 587 4 668 899 4 681 301 4 714 790
N50 1 005 009 3 478 799 964 998 1 433 016 1 654 501 375 502 1 266 511
Misassemblies
# misassemblies 9 9 7 8 9 7 8
Misassembled contigs length 2 210 994 3 490 490 2 901 005 3 496 520 3 754 889 2 256 498 3 197 010
Mismatches
# mismatches per 100kbp 0.63 0.28 0.22 0.91 0.15 0.54 0.47
# indels per 100kbp 5.02 2.55 1.84 7.08 1.68 4.91 6.12
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 99.842 99.776 99.889 100 99.985 99.979
Duplication ratio 1.012 1.008 1.005 1.006 1.006 1.009 1.018
# genes 4494+3 part 4485+6 part 4478+9 part 4482+11 part 4494+3 part 4493+4 part 4492+5 part
NGA50 614 657 1 088 544 572 342 875 453 843 983 632 720 1 265 743

DataSet2

Discard Contigs

Discard Lower-case bases

DataSet3

Discard Contigs

Discard Lower-case bases