HGAP

Revision as of 13 January 2015 20:01 by admin (Comments | Contribs) | (Postprocess by discarding lower-case bases)

Hierarchical Genome Assembly Process (HGAP) was proposed in the ref (Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Meth 2013).

Contents

Hierarchical genome-assembly process

We downloaded smrtanalysis-2.0.1 from DevNet, you can run the RS_HGAP_Assembly.1 and RS_Modification_and_Motif_Analysis.1 protocols on SMRT Portal or execute by command line.


Prepare data for HGAP Protocol
1. Build input XML file (detail step please refer to the tutorial)
2. Build HGAP parameters XML file : HGAP2.0.xml. We used default parameters setting mostly, and set minSubReadLength = 50, readScore = 0.75, minLength = 50.
3. execute HGAP protocol.

smrtpipe.py --params=HGAP.xml xml:input.xml

Import reference
1. After execute HGAP Protocol, there will be generating a polished_assemble.fasta.gz in "data" folder. The file serves as a reference for mapping the single pass reads as specified by the original filter parameters to the draft assembly to generate a higher accurate consensus sequence via Quiver
2. Import the reference by SMRT portal.
3. SMRT protal will generate a reference folder under /opt/smrtanalysis/common/userdata.d/references/XXXXXX. You can copy the whole folder to your working directory, or asign the path in the Quiver.xml

Prepare for Quiver
1. Build Quiver parameters XML file : Quiver.xml. We set minSubReadLength = 50, readScore = 0.75, minLength = 50, and the others we used default value.
2. execute Quiver protocol.

smrtpipe.py --params=Quiver.xml xml:input.xml

Dataset 5 (E. coli K-12 MG1655, 17 SMRT cells)

We randomly selected four, six and eight SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list.

Assembly

Statistics without reference 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set 8 SMRT cells : 1st Set 8 SMRT cells : 2nd Set 8 SMRT cells : 3rd Set
# contigs 5 10 4 11 7 8 6 10 5
Largest contig 3 770 578 4 106 852 4 644 754 3 785 116 4 647 724 3 287 965 4 649 322 4 623 068 4 649 308
Total length 4 684 069 4 723 363 4 671 153 4 736 342 4 711 060 4 708 831 4 706 433 4 731 334 4 691 736
N50 3 770 578 4 106 852 4 644 754 3 785 116 4 647 724 3 287 965 4 649 322 4 623 068 4 649 308
Misassemblies
# misassemblies 10 13 13 15 12 11 11 16 12
Misassembled contigs length 3 788 648 4 700 016 4 671 153 4 726 005 4 685 712 3 339 030 4 694 303 4 698 068 4 649 308
Mismatches
# mismatches per 100kbp 0.47 0.56 0.37 0.19 0.11 0.15 0.13 0.43 0.17
# indels per 100kbp 1.08 4.44 0.22 1.66 0.63 0.65 0.19 4.59 0.56
# N's per 100kbp 0 0 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.994 99.999 100 100 100 99.99 100
Duplication ratio 1.01 1.018 1.007 1.021 1.031 1.015 1.012 1.02 1.011
# genes 4495+2 part 4495+2 part 4493+3 part 4494+3 part 4495+2 part 4495+2 part 4495+2 part 4494+3 part 4495+2 part
NGA50 1 207 217 2 558 505 1 640 882 2 888 022 2 834 458 1 298 912 1 477 605 1 344 200 2 995 586
Running Time ?hr ?m ?hr ?m ?hr ?m 21hr 05m 19hr 32m 21hr 01m 26hr 46m 27hr 52m 26hr 13m


Postprocess by discarding unconvincing contigs

We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.

Statistics without reference 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set 8 SMRT cells : 1st Set 8 SMRT cells : 2nd Set 8 SMRT cells : 3rd Set
# contigs 2 6 1 5 2 4 2 3 2
Largest contig 3 770 578 4 106 852 4 644 754 3 785 116 4 647 724 3 287 965 4 649 322 4 623 068 4 649 308
Total length 4 651 736 4 691 077 4 644 754 4 675 943 4 660 074 4 671 197 4 664 502 4 661 980 4 661 084
N50 3 770 578 4 106 852 4 644 754 3 785 116 4 647 724 3 287 965 4 649 322 4 623 068 4 649 308
Misassemblies
# misassemblies 8 10 10 10 8 7 8 9 9
Misassembled contigs length 3 770 578 4 677 561 4 644 754 4 675 943 4 647 724 3 301 396 4 664 502 4 639 404 4 649 308
Mismatches
# mismatches per 100kbp 0.15 0.5 0.37 0.22 0.11 0.15 0.13 0.22 0.17
# indels per 100kbp 0.47 3.34 0.22 1.47 0.63 0.65 0.19 1.44 0.56
# N's per 100kbp 0 0 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.994 99.999 100 100 100 99.99 100
Duplication ratio 1.003 1.011 1.002 1.008 1.005 1.007 1.005 1.005 1.005
# genes 4494+3 part 4495+2 part 4493+3 part 4493+4 part 4495+2 part 4495+2 part 4495+2 part 4493+4 part 4495+2 part
NGA50 1 207 217 2 558 505 1 640 882 2 888 022 2 834 458 1 298 912 1 477 605 1 344 200 2 995 586

Postprocess by discarding lower-case bases

After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set 8 SMRT cells : 1st Set 8 SMRT cells : 2nd Set 8 SMRT cells : 3rd Set
# contigs 2 6 1 4 2 4 2 3 2
Largest contig 3 768 995 4 105 501 4 644 254 3 784 001 4 646 000 3 287 004 4 646 998 4 622 502 4 647 000
Total length 4 649 500 4 678 503 4 644 254 4 660 999 4 655 498 4 667 500 4 660 992 4 660 836 4 656 000
N50 3 768 995 4 105 501 4 644 254 3 784 001 4 646 000 3 287 004 4 646 998 4 622 502 4 647 000
Misassemblies
# misassemblies 8 10 10 9 8 8 8 9 8
Misassembled contigs length 3 768 995 4 666 999 4 644 254 4 660 999 4 646 000 3 299 005 4 660 992 4 638 338 4 647 000
Mismatches
# mismatches per 100kbp 0.15 0.5 0.37 0.19 0.11 0.11 0.13 0.22 0.17
# indels per 100kbp 0.32 2.76 0.22 1.44 0.5 0.58 0.19 1.34 0.47
# N's per 100kbp 0 0 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.994 99.999 100 100 100 99.99 100
Duplication ratio 1.002 1.008 1.002 1.005 1.003 1.006 1.005 1.005 1.004
# genes 4494+3 part 4494+3 part 4493+3 part 4493+4 part 4495+2 part 4495+2 part 4495+2 part 4493+4 part 4495+2 part
NGA50 1 207 217 2 558 154 1 640 382 2 888 022 2 833 234 1 298 912 1 476 281 1 344 200 2 995 586

Misassemblies for Adobe reader.

Dataset 6 (E. coli K-12 MG1655, 8 SMRT cells)

We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list.

Assembly

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 16 10 14 16 9 18 13
Largest contig 2 198 457 3 484 877 1 936 831 1 948 632 2 104 087 1 169 224 1 439 551
Total length 4 808 733 4 706 800 4 705 398 4 745 036 4 741 512 4 814 718 4 749 785
N50 1 005 770 3 484 877 966 809 1 434 284 1 655 500 676 526 1 268 010
Misassemblies
# misassemblies 19 9 12 15 14 17 11
Misassembled contigs length 2 939 040 3 530 352 2 949 761 3 653 461 3 820 624 2 387 129 3 986 402
Mismatches
# mismatches per 100kbp 0.8 0.43 0.58 1.36 0.15 0.95 0.58
# indels per 100kbp 5.71 2.98 4.45 9.56 1.77 8.02 6.88
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.815 99.87 100 99.995 99.979
Duplication ratio 1.037 1.016 1.017 1.025 1.022 1.038 1.025
# genes 4494+3 part 4494+3 part 4480+7 part 4485+9 part 4494+3 part 4493+4 part 4492+5 part
NGA50 615 234 1 205 052 572 342 875 953 844 482 633 220 1 267 242
Running Time 19hr 06m 13hr 34m 13hr 21m 12hr 38m 21hr 28m 22hr 56m 22hr 07m


Postprocess by discarding unconvincing contigs

We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 7 8 10 12 4 9 12
Largest contig 2 198 457 3 4848 77 1 936 831 1 948 632 2 104 087 1 169 224 1 439 551
Total length 4 706 061 4 674 582 4 659 277 4 682 754 4 680 475 4 702 993 4 739 366
N50 1 005 770 3 484 877 966 809 1 434 284 1 655 500 676 526 1 268 010
Misassemblies
# misassemblies 10 7 8 9 9 8 10
Misassembled contigs length 2 836 368 3 498 134 2 903 640 3 591 179 3 759 587 2 275 404 3 975 983
Mismatches
# mismatches per 100kbp 0.8 0.43 0.45 1.27 0.15 0.75 0.58
# indels per 100kbp 5.71 2.98 3.56 8.72 1.77 6.06 6.88
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 99.798 99.87 100 99.995 99.979
Duplication ratio 1.014 1.009 1.006 1.012 1.009 1.014 1.023
# genes 4494+3 part 4494+3 part 4479+8 part 4485+9 part 4494+3 part 4493+4 part 4492+5 part
NGA50 615 234 1 205 052 572 342 875 953 844 482 633 220 1 267 242



Postprocess by discarding lower-case bases

After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 7 8 10 12 4 9 12
Largest contig 2 196 495 3 478 799 1 936 007 1 948 495 2 100 388 1 165 497 1 438 506
Total length 4 694 972 4 662 655 4 649 216 4 657 587 4 668 899 4 681 301 4 714 790
N50 1 005 009 3 478 799 964 998 1 433 016 1 654 501 375 502 1 266 511
Misassemblies
# misassemblies 9 9 8 9 10 9 10
Misassembled contigs length 2 210 994 3 490 490 2 901 005 3 496 520 3 754 889 2 256 498 3 197 010
Mismatches
# mismatches per 100kbp 0.63 0.28 0.22 0.91 0.15 0.54 0.47
# indels per 100kbp 5 2.5 1.84 6.8 1.64 4.63 5.99
# N's per 100kbp 0 0 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 99.842 99.776 99.859 100 99.985 99.979
Duplication ratio 1.012 1.007 1.005 1.006 1.006 1.009 1.016
# genes 4494+3 part 4485+6 part 4478+9 part 4482+11 part 4494+3 part 4493+4 part 4492+5 part
NGA50 614 657 949 284 432 003 853 140 747 216 579 994 672 148

Misassemblies for Adobe reader.

Dataset 7 (M. ruber DSM1279, 4 SMRT cells)

We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome (NC_013946)and Mr_gene_list.

Assembly

Statistics without reference All Data
# contigs 3
Largest contig 2 548 031
Total length 3 121 070
N50 2 548 031
Misassemblies
# misassemblies 1
Misassembled contigs length 2 548 031
Mismatches
# mismatches per 100kbp 0.52
# indels per 100kbp 2.71
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 99.986
Duplication ratio 1.017
# genes 3103+2 part
NGA50 1 155 126
Running Time 18hr 19m

Postprocess by discarding lower-case bases

We discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference All Data
# contigs 3
Largest contig 2 545 501
Total length 3 115 015
N50 2 545 501
Misassemblies
# misassemblies 1
Misassembled contigs length 2 545 501
Mismatches
# mismatches per 100kbp 0.42
# indels per 100kbp 2.52
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 99.986
Duplication ratio 1.006
# genes 3103+2 part
NGA50 1 153 096

Dataset 8 (P. heparinus DSM2366, 7 SMRT cells)

We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_013061) and Ph_gene_list

Assembly

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set
# contigs 3 3 3 6
Largest contig 2 934 267 2 927 454 2 929 942 2 226 051
Total length 5 178 932 5 176 592 5 176 771 5 182 410
N50 2 934 267 2 927 454 2 929 942 2 133 457
Misassemblies
# misassemblies 0 1 0 1
Misassembled contigs length 0 2 240 169 0 13 124
Mismatches
# mismatches per 100kbp 0 0.02 0.06 6.45
# indels per 100kbp 1.05 0.54 0.6 1.88
# N's per 100kbp 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 100 99.936
Duplication ratio 1.003 1.003 1.003 1.006
# genes 4338+1 part 4338+1 part 4338+1 part 4335+4 part
NGA50 2 934 267 2 927 454 2 929 942 2 133 457
Running Time 24hr 56m 17hr 41m 18hr 14m 17hr 04m

Postprocess by discarding unconvincing contigs

We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set
# contigs 3 2 2 5
Largest contig 2 943 267 2 927 454 2 929 942 2 226 051
Total length 5 178 932 5 167 623 5 167 190 5 172 946
N50 2 934 267 2 927 454 2 929 942 2 133 457
Misassemblies
# misassemblies 0 1 0 1
Misassembled contigs length 0 2 240 169 0 13 124
Mismatches
# mismatches per 100kbp 0 0.04 0.08 6.45
# indels per 100kbp 1.05 0.68 0.6 1.82
# N's per 100kbp 0 0 0 0
Genome Statistics
Genome fraction(%) 100 99.951 99.916 99.878
Duplication ratio 1.003 1.002 1.002 1.004
# genes 4338+1 part 4336+2 part 4335+3 part 4333+5 part
NGA50 2 934 267 2 927 454 2 929 942 2 133 457

Postprocess by discarding lower-case bases

After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set
# contigs 3 2 2 5
Largest contig 2 932 503 2 925 498 2 925 998 2 225 051
Total length 5 175 001 5 163 999 5 162 498 5 161 405
N50 2 932 503 2 925 498 2 925 998 2 131 500
Misassemblies
# misassemblies 0 1 0 0
Misassembled contigs length 0 2 238 501 0 0
Mismatches
# mismatches per 100kbp 0.02 0.06 9.98 6.42
# indels per 100kbp 0.77 0.52 0.85 1.44
# N's per 100kbp 0 0 0 0
Genome Statistics
Genome fraction(%) 100 99.931 99.869 99.782
Duplication ratio 1.001 1.001 1 1.001
# genes 4338+1 part 4336+2 part 4331+4 part 4328+7 part
NGA50 2 932 503 2 925 498 2 925 998 2 131 500

Misassemblies for Adobe reader.

Dataset 9 (E. coli K-12, P4-C2 chemistry, 20 Kbp, 1 SMRT cell)

We used all SMRT cells and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list.

Assembly

We used the one SMRT cell and access the correctness by Quast

Statistics without reference All Data
# contigs 2
Largest contig 4 656 681
Total length 4 672 546
N50 4 656 681
Misassemblies
# misassemblies 9
Misassembled contigs length 4 672 546
Mismatches
# mismatches per 100kbp 0.15
# indels per 100kbp 4.87
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.007
# genes 4494+3 part
NGA50 2 995 500
Running Time 16hr 40m

Postprocess by discarding unconvincing contigs

We aligned subreads to contigs, and discarded the contigs with fewer than 100 reads aligned.

Statistics without reference All Data
# contigs 1
Largest contig 4 656 681
Total length 4 656 681
N50 4 656 681
Misassemblies
# misassemblies 8
Misassembled contigs length 4 656 681
Mismatches
# mismatches per 100kbp 0.15
# indels per 100kbp 4.87
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.004
# genes 4494+3 part
NGA50 2 995 500

Postprocess by discarding lower-case bases

After discarding unconvincing contigs, we discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference All Data
# contigs 1
Largest contig 4 654 377
Total length 4 654 377
N50 4 654 377
Misassemblies
# misassemblies 8
Misassembled contigs length 4 654 377
Mismatches
# mismatches per 100kbp 0.15
# indels per 100kbp 4.81
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.003
# genes 4494+3 part
NGA50 3 026 319

HGAP 3.0 with Dataset 9

We used HGAP3.0.xml protocol and ran dataset 9 on SMRT portal.

Assembly

with different genomeSize

Statistics without reference genomeSize=4650000 genomeSize=4185000 genomeSize=5115000 genomeSize=3720000 genomeSize=5580000
# contigs 1 1 1 1 1
Largest contig 4657584 4657584 4657492 4657578 4657479
Total length 4657584 4657584 4657492 4657578 4657479
N50 4657584 4657584 4657492 4657578 4657479
Misassemblies
# misassemblies 8 8 8 8 8
Misassembled contigs length 4657584 4657584 4657492 4657578 4657479
Mismatches
# mismatches per 100kbp 0.15 0.15 0.15 0.15 0.15
# indels per 100kbp 0.19 0.19 0.17 0.19 0.17
# N's per 100kbp 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 100 100 100
Duplication ratio 1.004 1.004 1.004 1.004 1.004
# genes 4494 + 3 part 4494 + 3 part 4494 + 3 part 4494 + 3 part 4494 +3 part
NGA50 3026417 3026417 3026417 3026417 3026417
Running Time 2hr 21m 2hr 8m 2hr 13m 2hr 2m 2hr 22m

Postprocess by discarding lower-case bases

We discarded low quality bases which present in lower-case from contigs two-side ends. more detail

Statistics without reference genomeSize=4650000 genomeSize=4185000 genomeSize=5115000 genomeSize=3720000 genomeSize=5580000
# contigs 1 1 1 1 1
Largest contig 4656344 4656344 4656242 4656345 4656234
Total length 4656344 4656344 4656242 4656345 4656234
N50 4656344 4656344 4656242 4656345 4656234
Misassemblies
# misassemblies 8 8 8 8 8
Misassembled contigs length 4656344 4656344 4656242 4656345 4656234
Mismatches
# mismatches per 100kbp 0.15 0.15 0.15 0.15 0.15
# indels per 100kbp 0.19 0.19 0.17 0.19 0.17
# N's per 100kbp 0 0 0 0 0
Genome Statistics
Genome fraction(%) 100 100 100 100 100
Duplication ratio 1.004 1.004 1.004 1.004 1.004
# genes 4494 + 3 part 4494 + 3 part 4494 + 3 part 4494 + 3 part 4494 +3 part
NGA50 3026417 3026417 3026417 3026417 3026417

without genomesize

Statistics without reference All Data
# contigs 1
Largest contig 4657553
Total length 4657553
N50 4657553
Misassemblies
# misassemblies 8
Misassembled contigs length 4657553
Mismatches
# mismatches per 100kbp 0.15
# indels per 100kbp 0.19
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.004
# genes 4494+3 part
NGA50 3026417

Postprocess by discarding lower-case bases

Statistics without reference All Data
# contigs 1
Largest contig 4656299
Total length 4656299
N50 4656299
Misassemblies
# misassemblies 8
Misassembled contigs length 4656344
Mismatches
# mismatches per 100kbp 0.15
# indels per 100kbp 0.19
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.004
# genes 4494+3 part
NGA50 3026417