Line 1: |
|
|
+ |
PBcR pipeline(S) was proposed in the [http://www.ncbi.nlm.nih.gov/pubmed/24034426 ref] (''Reducing assembly complexity of microbial genomes with single-molecule sequencin'', Genome Biology 2013).
|
|
|
|
|
|
|
+ |
=Self-Correction Assembly=
|
|
|
+ |
Following we used ''E. coli'' as an example to show the steps. All the necessary programs were downloaded from [http://www.cbcb.umd.edu/software/PBcR/closure/ cbcb] (or direct download the [ftp://ftp.cbcb.umd.edu/pub/data/PBcR/closure_paper/wgs-package.tar.gz package]) and more detail information was described at [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA PacBioToCA]
|
|
|
|
|
|
|
+ |
1. Long reads self-correction
|
|
|
+ |
pacBioToCA -length 500 -partitions 200 -l pacbio -t 6 -s pacbio.spec -fastq Filtered_four.fastq longReads=1 genomeSize=4650000
|
|
|
|
|
|
|
+ |
2. Trimming the corrected long reads
|
|
|
+ |
python trimFastqByQVWindow.py --qvCut=54.5 --out=trimmed.pacbio.fastq pacbio.fastq
|
|
|
|
|
|
|
+ |
3. Convert the data format from fastq to frg
|
|
|
+ |
java convertFastqToFastaAndQual trimmed.pacbio.fastq trimmed.pacbio.fasta trimmed.pacbio.qual
|
|
|
+ |
convert-fasta-to-v2.pl -l Pacbio -s trimmed.pacbio.fasta -q trimmed.pacbio.qual > trimmed.pacbio.frg
|
|
|
|
|
|
|
+ |
4. Select 25X longest corrected long reads
|
|
|
+ |
gatekeeper -T -F -o asm.gkpStore trimmed.pacbio.frg
|
|
|
+ |
gatekeeper -dumpfasta 25X_Clr -longestlength 0 116250000 asm.gkpStore
|
|
|
+ |
gatekeeper -dumpfrg -longestlength 0 116250000 asm.gkpStore > 25X_Clr.frg
|
|
|
|
|
|
|
+ |
5. Assemble
|
|
|
+ |
runCA -p asm -d asm -s asm.spec 25X_Clr.frg
|
|
|
+ |
=Dataset 5 (''E. coli'' K-12 MG1655, 17 SMRT cells)=
|
|
|
+ |
We used all SMRT cells and randomly selected four, six and eight SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome ([ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/ NC_000913]) and [[Media: Ec_gene_result.ncbi | Ec_gene_list]]. ([http://sb.nhri.org.tw/comps/quast/SCA/D5/report.html more detail])
|
|
|
|
|
|
|
+ |
==Performance==
|
|
|
+ |
{| {{table}} border="1"
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''Statistics without reference '''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''All Data'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 3rd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 3rd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''8 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''8 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''8 SMRT cells : 3rd Set'''
|
|
|
+ |
|-
|
|
|
+ |
|# contigs||[[Media:sca_d5_all.fa | 1]]||[[Media:sca_d5_rg4_1.fa | 1]]||[[Media:sca_d5_rg4_2.fa | 1]]||[[Media:sca_d5_rg4_3.fa | 5]]||[[Media:sca_d5_rg6_1.fa | 2]]||[[Media:sca_d5_rg6_2.fa | 2]]||[[Media:sca_d5_rg6_3.fa | 1]]||[[Media:sca_d5_rg8_1.fa | 4]]||[[Media:sca_d5_rg8_2.fa | 1]]||[[Media:sca_d5_rg8_3.fa | 2]]
|
|
|
+ |
|-
|
|
|
+ |
|Largest contig||4 651 604||4 647 117||4 648 057||3 447 068||3 749 516||2 770 859||4 649 699||1 679 082||4 649 323||4 189 785
|
|
|
+ |
|-
|
|
|
+ |
|Total length||4 651 604||4 647 117||4 648 057||4 661 453||4 645 941||4 657 272||4 649 699||4 655 949||4 649 323||4 652 482
|
|
|
+ |
|-
|
|
|
+ |
|N50||4 651 604||4 647 117||4 648 057||3 447 068||3 749 516||2 770 859||4 649 699||1 159 845||4 649 323||4 189 785
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| [[Media: SCA_D5.pdf | '''Misassemblies''']]||||||||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# misassemblies||9||7||8||7||7||10||10||6||10||8
|
|
|
+ |
|-
|
|
|
+ |
|Misassembled contigs length ||4 651 604||4 647 117||4 648 057||3 447 068||6 645 941||4 657 272||4 649 699||2 143 406||4 649 323||4 189 785
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Mismatches'''||||||||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# mismatches per 100kbp||0.34||1.03||0.69||0.78||0.69||0.56||0.58||0.75||0.86||0.75
|
|
|
+ |
|-
|
|
|
+ |
|# indels per 100kbp||0.65||7.65||5.78||5.78||1.88||2.89||1.75||1.62||2||2.65
|
|
|
+ |
|-
|
|
|
+ |
|# N's per 100kbp ||0||0.02||0||0||0||0||0||0||0.02||0
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||||||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|Genome fraction(%) ||100||100||100||99.949||99.956||100||100||99.959||100||99.993
|
|
|
+ |
|-
|
|
|
+ |
|Duplication ratio ||1.003||1.002||1.002||1.005||1.002||1.005||1.002||1.004||1.002||1.003
|
|
|
+ |
|-
|
|
|
+ |
|# genes ||4494+3 part||4494+3 part||4494+3 part||4490+5 part||4491+2 part||4495+2 part||4494+3 part||4489+6 part||4494+3 part||4493+4 part
|
|
|
+ |
|-
|
|
|
+ |
|NGA50 ||1 207 212||1 428 636||1 432 247||983 650||1 552 642||873 232||2 995 552||1 062 313||2 956 338||1 207 192
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Running Time'''||||||||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|PacBioToCA||48hr 16m||4hr 58m||5hr 48m||5hr 10m||11hr 09m||9hr 34m||10hr 47m||21hr 06m||22hr 05m||21hr 23m
|
|
|
+ |
|-
|
|
|
+ |
|runCA||15hr 48m||15hr 22m||13hr 50m||11hr 20m||12hr 38m||11hr 44m||13hr 48m||11hr 37m||14hr 36m||13hr 40m
|
|
|
+ |
|-
|
|
|
+ |
|Total||64hr 04m||20hr 20m||19hr 38m||16hr 30m||23hr 47m||21hr 18m||24hr 35m||32hr 43m||36hr 41m||25hr 03m
|
|
|
+ |
|}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+ |
=Dataset 6 (''E.coli'' K-12 MG1655, 8 SMRT cells)=
|
|
|
+ |
We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome ([ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/ NC_000913]) and [[Media: Ec_gene_result.ncbi | Ec_gene_list]]. ([http://sb.nhri.org.tw/comps/quast/SCA/D6/report.html more detail])
|
|
|
|
|
|
|
+ |
==Performance==
|
|
|
+ |
{| {{table}} border="1"
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''Statistics without reference '''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''All Data'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 3rd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''6 SMRT cells : 3rd Set'''
|
|
|
+ |
|-
|
|
|
+ |
|# contigs||[[Media:sca_d6_all.fa | 2]]||[[Media:sca_d6_rg4_1.fa | 8]]||[[Media:sca_d6_rg4_2.fa | 10]]||[[Media:sca_d6_rg4_3.fa | 14]]||[[Media:sca_d6_rg6_1.fa | 1]]||[[Media:sca_d6_rg6_2.fa | 1]]||[[Media:sca_d6_rg6_3.fa | 4]]
|
|
|
+ |
|-
|
|
|
+ |
|Largest contig||4 278 957||2 277 010||1 213 670||984 459||4 641 350||4 640 250||3 162 440
|
|
|
+ |
|-
|
|
|
+ |
|Total length||4 650 771||4 648 304||4 644 602||4 656 274||4 641 350||4 640 250||4 653 394
|
|
|
+ |
|-
|
|
|
+ |
|N50||4 278 957||2 043 590||2 044 147||2 135 225||3 162 440||4 640 250||4 641 350
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| [[Media:SCA_d6.pdf | '''Misassemblies''']]||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# misassemblies||8||10||8||6||7||7||8
|
|
|
+ |
|-
|
|
|
+ |
|Misassembled contigs length ||4 278 957||2 809 129||2 085 482||1 947 163||4 641 350||4 640 250||3 209 090
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Mismatches'''||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# mismatches per 100kbp||0.37||2.49||1.88||5.38||0.69||0.67||0.86
|
|
|
+ |
|-
|
|
|
+ |
|# indels per 100kbp||3.64||56.81||47.62||77.31||10.67||12.87||11
|
|
|
+ |
|-
|
|
|
+ |
|# N's per 100kbp ||0||0.04||0.02||0.09||0||0||0
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|Genome fraction(%) ||99.93||99.733||99.67||99.693||99.972||99.946||99.968
|
|
|
+ |
|-
|
|
|
+ |
|Duplication ratio ||1.003||1.006||1.005||1.008||1.001||1.001||1.005
|
|
|
+ |
|-
|
|
|
+ |
|# genes ||4492+5 part||4475+10 part||4467+12 part||4469+13 part||4492+4 part||4491+4 part||4492+4 part
|
|
|
+ |
|-
|
|
|
+ |
|NGA50 ||1 207 233||531 351||721 189||565 251||2 499 057||2 499 697||1 267 262
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Running Time'''||||||||||||||
|
|
|
+ |
|-
|
|
|
+ |
|pacBioToCA||20hr 03m||5hr 52m||6hr 05m||5hr 19m||15hr 53m||14hr 47m||15hr 38m
|
|
|
+ |
|-
|
|
|
+ |
|runCA||15hr 41m||7hr 32m||7hr 10m||5hr 42m||15hr 44m||16hr 02m||13hr 27m
|
|
|
+ |
|-
|
|
|
+ |
|Total||35hr 44m||13hr 24m||13hr 15m||11hr 01m||31hr 37m||30hr 49m||29hr 05m
|
|
|
+ |
|}
|
|
|
+ |
[[Media: sca_d6_summary.pdf | '''Misassemblies''']] for Adobe reader.
|
|
|
+ |
=Dataset 7, (''M. ruber'' DSM1279, 4 SMRT cells)=
|
|
|
+ |
We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome ([ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Meiothermus_ruber_DSM_1279_uid46661/ NC_013946]) and [[Media:gene_mruber.ncbi | Mr_gene_list]]. ([http://sb.nhri.org.tw/comps/quast/SCA/D7/report.html more detail])
|
|
|
+ |
==Performance==
|
|
|
+ |
{| {{table}} border="1"
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''Statistics without reference '''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''All Data'''
|
|
|
|
|
|
|
+ |
|-
|
|
|
+ |
|# contigs||[[Media:sca_d7_all.fa | 2]]
|
|
|
+ |
|-
|
|
|
+ |
|Largest contig||2 974 307
|
|
|
+ |
|-
|
|
|
+ |
|Total length||3 100 289
|
|
|
+ |
|-
|
|
|
+ |
|N50||2 974 307
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| [[Media:sca_d7.pdf | '''Misassemblies''']]||
|
|
|
+ |
|-
|
|
|
+ |
|# misassemblies||3
|
|
|
+ |
|-
|
|
|
+ |
|Misassembled contigs length ||2 974 307
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Mismatches'''||
|
|
|
+ |
|-
|
|
|
+ |
|# mismatches per 100kbp||0.23
|
|
|
+ |
|-
|
|
|
+ |
|# indels per 100kbp||5.04
|
|
|
+ |
|-
|
|
|
+ |
|# N's per 100kbp ||0.03
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||
|
|
|
+ |
|-
|
|
|
+ |
|Genome fraction(%) ||99.883
|
|
|
+ |
|-
|
|
|
+ |
|Duplication ratio ||1.002
|
|
|
+ |
|-
|
|
|
+ |
|# genes ||3093+4 part
|
|
|
+ |
|-
|
|
|
+ |
|NGA50 ||1 715 029
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Running Time'''||
|
|
|
+ |
|-
|
|
|
+ |
|pacBioToCA||7hr 35m
|
|
|
+ |
|-
|
|
|
+ |
|runCA||8hr 7m
|
|
|
+ |
|-
|
|
|
+ |
|Total||15hr 42m
|
|
|
+ |
|}
|
|
|
|
|
|
|
+ |
=Dataset 8 (''P. heparinus'' DSM1279, 7 SMRT cells)=
|
|
|
+ |
We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome ([ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Pedobacter_heparinus_DSM_2366_uid59111/ NC_013061]) and [[Media:gene_phep.ncbi | Ph_gene_list]]. ([http://sb.nhri.org.tw/comps/quast/SCA/D8/report.html more detail])
|
|
|
+ |
==Performance==
|
|
|
+ |
{| {{table}} border="1"
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''Statistics without reference '''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''All Data'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 1st Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 2nd Set'''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''4 SMRT cells : 3rd Set'''
|
|
|
+ |
|-
|
|
|
+ |
|# contigs||[[Media: sca_d8_all.fa | 1]]||[[Media: sca_d8_rg4_1.fa | 3]]||[[Media: sca_d8_rg4_2.fa | 3]]||[[Media: sca_d8_rg4_3.fa | 3]]
|
|
|
+ |
|-
|
|
|
+ |
|Largest contig||5 163 983||2 232 679||2 236 613||2 237 949
|
|
|
+ |
|-
|
|
|
+ |
|Total length||5 163 983||5 161 276||5 165 518||5 166 563
|
|
|
+ |
|-
|
|
|
+ |
|N50||5 163 983||2 043 590||2 044 147||2 135 225
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| [[Media:sca_d8.pdf | '''Misassemblies''']]||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# misassemblies||1||0||0||0
|
|
|
+ |
|-
|
|
|
+ |
|Misassembled contigs length ||5 163 983||0||0||0
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Mismatches'''||||||||
|
|
|
+ |
|-
|
|
|
+ |
|# mismatches per 100kbp||8.41||9.960||8.27||10.29
|
|
|
+ |
|-
|
|
|
+ |
|# indels per 100kbp||2.19||21.34||13.29||14.78
|
|
|
+ |
|-
|
|
|
+ |
|# N's per 100kbp ||0||0||0||0
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||||||||
|
|
|
+ |
|-
|
|
|
+ |
|Genome fraction(%) ||99.919||99.864||99.907||99.89
|
|
|
+ |
|-
|
|
|
+ |
|Duplication ratio ||1.001||1.001||1.002||1.002
|
|
|
+ |
|-
|
|
|
+ |
|# genes ||4335+3 part||4330+5 part||4333+5 part||4333+3 part
|
|
|
+ |
|-
|
|
|
+ |
|NGA50 ||4 300 532||2 043 590||2 044 147||2 135 225
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Running Time'''||||||||
|
|
|
+ |
|-
|
|
|
+ |
|pacBioToCA||18hr 55m||6hr 27m||6hr 34m||6hr 31m
|
|
|
+ |
|-
|
|
|
+ |
|runCA||21hr 36m||11hr 39m||12hr 26m||12hr 12m
|
|
|
+ |
|-
|
|
|
+ |
|Total||40hr 31m||18hr 06m||19hr 00n||18hr 43m
|
|
|
+ |
|}
|
|
|
+ |
[[Media: sca_d8_summary.pdf | '''Misassemblies''']] for Adobe reader.
|
|
|
+ |
=Dataset 9 (''E. coli'' K-12, P4-C2 chemistry, 20 Kbp, 1 SMRT cell)=
|
|
|
+ |
We used all SMRT cells and evaluated the assemblies by QUAST against the reference genome ([ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/ NC_000913]) and [[Media: Ec_gene_result.ncbi | Ec_gene_list]]. ([http://sb.nhri.org.tw/comps/quast/SCA/D9/report.html more detail])
|
|
|
|
|
|
|
+ |
==Performance==
|
|
|
|
|
|
|
+ |
{| {{table}} border="1"
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''Statistics without reference '''
|
|
|
+ |
| align="center" style="background:#f0f0f0;"|'''All Data'''
|
|
|
|
|
|
|
+ |
|-
|
|
|
+ |
|# contigs||[[Media:sca_d9_all.fa |1]]
|
|
|
+ |
|-
|
|
|
+ |
|Largest contig||4 656257
|
|
|
+ |
|-
|
|
|
+ |
|Total length||4 656 257
|
|
|
+ |
|-
|
|
|
+ |
|N50||4 656 257
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| [[Media: SCA_D9.pdf | '''Misassemblies''']]||
|
|
|
+ |
|-
|
|
|
+ |
|# misassemblies||8
|
|
|
+ |
|-
|
|
|
+ |
|Misassembled contigs length ||4 656 257
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Mismatches'''||
|
|
|
+ |
|-
|
|
|
+ |
|# mismatches per 100kbp||0.22
|
|
|
+ |
|-
|
|
|
+ |
|# indels per 100kbp||13.23
|
|
|
+ |
|-
|
|
|
+ |
|# N's per 100kbp ||0
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||
|
|
|
+ |
|-
|
|
|
+ |
|Genome fraction(%) ||100
|
|
|
+ |
|-
|
|
|
+ |
|Duplication ratio ||1.004
|
|
|
+ |
|-
|
|
|
+ |
|# genes ||4494+3 part
|
|
|
+ |
|-
|
|
|
+ |
|NGA50 ||2 995 284
|
|
|
+ |
|-
|
|
|
+ |
| style="background:#f0f0f0;"| '''Genome Statistics'''||
|
|
|
+ |
|-
|
|
|
+ |
|PacBioToCA||13hr 01m
|
|
|
+ |
|-
|
|
|
+ |
|runCA||17hr 58m
|
|
|
+ |
|-
|
|
|
+ |
|Total||30hr 59m
|
|
|
+ |
|}
|