PBcR pipeline(S) was proposed in the ref (Reducing assembly complexity of microbial genomes with single-molecule sequencin, Genome Biology 2013).
Contents |
---|
Following we used E. coli as an example to show the steps. All the necessary programs were downloaded from cbcb (or direct download the package) and more detail information was described at PacBioToCA
1. Long reads self-correction
pacBioToCA -length 500 -partitions 200 -l pacbio -t 6 -s pacbio.spec -fastq Filtered_four.fastq longReads=1 genomeSize=4650000
2. Trimming the corrected long reads
python trimFastqByQVWindow.py --qvCut=54.5 --out=trimmed.pacbio.fastq pacbio.fastq
3. Convert the data format from fastq to frg
java convertFastqToFastaAndQual trimmed.pacbio.fastq trimmed.pacbio.fasta trimmed.pacbio.qual convert-fasta-to-v2.pl -l Pacbio -s trimmed.pacbio.fasta -q trimmed.pacbio.qual > trimmed.pacbio.frg
4. Select 25X longest corrected long reads
gatekeeper -T -F -o asm.gkpStore trimmed.pacbio.frg gatekeeper -dumpfasta 25X_Clr -longestlength 0 116250000 asm.gkpStore gatekeeper -dumpfrg -longestlength 0 116250000 asm.gkpStore > 25X_Clr.frg
5. Assemble
runCA -p asm -d asm -s asm.spec 25X_Clr.frg
We used all SMRT cells and randomly selected four, six and eight SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)
Statistics without reference | All Data | 4 SMRT cells : 1st Set | 4 SMRT cells : 2nd Set | 4 SMRT cells : 3rd Set | 6 SMRT cells : 1st Set | 6 SMRT cells : 2nd Set | 6 SMRT cells : 3rd Set | 8 SMRT cells : 1st Set | 8 SMRT cells : 2nd Set | 8 SMRT cells : 3rd Set |
# contigs | 1 | 1 | 1 | 5 | 2 | 2 | 1 | 4 | 1 | 2 |
Largest contig | 4 651 604 | 4 647 117 | 4 648 057 | 3 447 068 | 3 749 516 | 2 770 859 | 4 649 699 | 1 679 082 | 4 649 323 | 4 189 785 |
Total length | 4 651 604 | 4 647 117 | 4 648 057 | 4 661 453 | 4 645 941 | 4 657 272 | 4 649 699 | 4 655 949 | 4 649 323 | 4 652 482 |
N50 | 4 651 604 | 4 647 117 | 4 648 057 | 3 447 068 | 3 749 516 | 2 770 859 | 4 649 699 | 1 159 845 | 4 649 323 | 4 189 785 |
Misassemblies | ||||||||||
# misassemblies | 9 | 9 | 8 | 9 | 7 | 9 | 10 | 7 | 10 | 8 |
Misassembled contigs length | 4 651 604 | 4 647 117 | 4 648 057 | 3 447 068 | 4 645 941 | 4 657 272 | 4 649 699 | 2 143 406 | 4 649 323 | 4 189 785 |
Mismatches | ||||||||||
# mismatches per 100kbp | 0.34 | 1.03 | 0.69 | 0.78 | 0.69 | 0.56 | 0.58 | 0.75 | 0.84 | 0.75 |
# indels per 100kbp | 0.6 | 7.44 | 5.78 | 5.67 | 1.88 | 2.72 | 1.66 | 1.6 | 1.9 | 2.65 |
# N's per 100kbp | 0 | 0.02 | 0 | 0 | 0 | 0 | 0 | 0 | 0.02 | 0 |
Genome Statistics | ||||||||||
Genome fraction(%) | 100 | 100 | 100 | 99.949 | 99.956 | 100 | 100 | 99.959 | 100 | 99.993 |
Duplication ratio | 1.003 | 1.002 | 1.002 | 1.005 | 1.002 | 1.005 | 1.002 | 1.004 | 1.002 | 1.003 |
# genes | 4494+3 part | 4494+3 part | 4494+3 part | 4490+5 part | 4491+2 part | 4495+2 part | 4494+3 part | 4489+6 part | 4494+3 part | 4493+4 part |
NGA50 | 2 834 925 | 949 217 | 949 242 | 656 513 | 2 796 469 | 1 040 965 | 3 026 388 | 949 298 | 3 027 267 | 949 289 |
Running Time | ||||||||||
PacBioToCA | 48hr 16m | 4hr 58m | 5hr 48m | 5hr 10m | 11hr 09m | 9hr 34m | 10hr 47m | 21hr 06m | 22hr 05m | 21hr 23m |
runCA | 15hr 48m | 15hr 22m | 13hr 50m | 11hr 20m | 12hr 38m | 11hr 44m | 13hr 48m | 11hr 37m | 14hr 36m | 13hr 40m |
Total | 64hr 04m | 20hr 20m | 19hr 38m | 16hr 30m | 23hr 47m | 21hr 18m | 24hr 35m | 32hr 43m | 36hr 41m | 25hr 03m |
We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)
Statistics without reference | All Data | 4 SMRT cells : 1st Set | 4 SMRT cells : 2nd Set | 4 SMRT cells : 3rd Set | 6 SMRT cells : 1st Set | 6 SMRT cells : 2nd Set | 6 SMRT cells : 3rd Set |
# contigs | 2 | 8 | 10 | 14 | 1 | 1 | 4 |
Largest contig | 4 278 957 | 2 277 010 | 1 213 670 | 984 459 | 4 641 350 | 4 640 250 | 3 162 440 |
Total length | 4 650 771 | 4 648 304 | 4 644 602 | 4 656 274 | 4 641 350 | 4 640 250 | 4 653 394 |
N50 | 4 278 957 | 622 425 | 800 993 | 565 251 | 4 641 350 | 4 640 250 | 3 162 440 |
Misassemblies | |||||||
# misassemblies | 9 | 9 | 9 | 8 | 8 | 8 | 9 |
Misassembled contigs length | 4 278 957 | 2 809 129 | 2 085 482 | 1 947 163 | 4 641 350 | 4 640 250 | 3 209 090 |
Mismatches | |||||||
# mismatches per 100kbp | 0.37 | 2.49 | 1.88 | 5.38 | 0.69 | 0.67 | 0.86 |
# indels per 100kbp | 3.58 | 53.34 | 45.82 | 73.07 | 10.65 | 11.28 | 10.46 |
# N's per 100kbp | 0 | 0.04 | 0.02 | 0.09 | 0 | 0 | 0 |
Genome Statistics | |||||||
Genome fraction(%) | 99.993 | 99.733 | 99.67 | 99.693 | 99.972 | 99.946 | 99.968 |
Duplication ratio | 1.002 | 1.005 | 1.006 | 1.007 | 1.001 | 1.001 | 1.003 |
# genes | 4492+5 part | 4475+10 part | 4467+12 part | 4469+13 part | 4492+4 part | 4491+4 part | 4492+4 part |
NGA50 | 859 464 | 621 281 | 572 455 | 436 292 | 1 098 529 | 1 096 784 | 859 502 |
Running Time | |||||||
pacBioToCA | 20hr 03m | 5hr 52m | 6hr 05m | 5hr 19m | 15hr 53m | 14hr 47m | 15hr 38m |
runCA | 15hr 41m | 7hr 32m | 7hr 10m | 5hr 42m | 15hr 44m | 16hr 02m | 13hr 27m |
Total | 35hr 44m | 13hr 24m | 13hr 15m | 11hr 01m | 31hr 37m | 30hr 49m | 29hr 05m |
Misassemblies for Adobe reader.
We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome (NC_013946) and Mr_gene_list. (more detail)
Statistics without reference | All Data |
# contigs | 2 |
Largest contig | 2 974 307 |
Total length | 3 100 289 |
N50 | 2 974 307 |
Misassemblies | |
# misassemblies | 3 |
Misassembled contigs length | 2 974 307 |
Mismatches | |
# mismatches per 100kbp | 0.23 |
# indels per 100kbp | 5.01 |
# N's per 100kbp | 0.03 |
Genome Statistics | |
Genome fraction(%) | 99.883 |
Duplication ratio | 1.002 |
# genes | 3093+4 part |
NGA50 | 1 707 938 |
Running Time | |
pacBioToCA | 7hr 35m |
runCA | 8hr 7m |
Total | 15hr 42m |
We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_013061) and Ph_gene_list. (more detail)
Statistics without reference | All Data | 4 SMRT cells : 1st Set | 4 SMRT cells : 2nd Set | 4 SMRT cells : 3rd Set |
# contigs | 1 | 3 | 3 | 3 |
Largest contig | 5 163 983 | 2 232 679 | 2 236 613 | 2 237 949 |
Total length | 5 163 983 | 5 161 276 | 5 165 518 | 5 166 563 |
N50 | 5 163 983 | 2 043 590 | 2 044 147 | 2 135 225 |
Misassemblies | ||||
# misassemblies | 0 | 0 | 0 | 0 |
Misassembled contigs length | 0 | 0 | 0 | 0 |
Mismatches | ||||
# mismatches per 100kbp | 8.41 | 9.960 | 8.27 | 10.29 |
# indels per 100kbp | 2.19 | 18.99 | 13.13 | 14.01 |
# N's per 100kbp | 0 | 0 | 0 | 0 |
Genome Statistics | ||||
Genome fraction(%) | 99.919 | 99.864 | 99.907 | 99.89 |
Duplication ratio | 1 | 1 | 1.001 | 1.001 |
# genes | 4335+3 part | 4330+5 part | 4333+5 part | 4333+3 part |
NGA50 | 5 163 983 532 | 2 043 590 | 2 044 147 | 2 135 225 |
Running Time | ||||
pacBioToCA | 18hr 55m | 6hr 27m | 6hr 34m | 6hr 31m |
runCA | 21hr 36m | 11hr 39m | 12hr 26m | 12hr 12m |
Total | 40hr 31m | 18hr 06m | 19hr 00n | 18hr 43m |
Misassemblies for Adobe reader.
We used all SMRT cells and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)
Statistics without reference | All Data |
# contigs | 1 |
Largest contig | 4 656257 |
Total length | 4 656 257 |
N50 | 4 656 257 |
Misassemblies | |
# misassemblies | 8 |
Misassembled contigs length | 4 656 257 |
Mismatches | |
# mismatches per 100kbp | 0.22 |
# indels per 100kbp | 13.15 |
# N's per 100kbp | 0 |
Genome Statistics | |
Genome fraction(%) | 100 |
Duplication ratio | 1.004 |
# genes | 4494+3 part |
NGA50 | 3 026 094 |
Running Time | |
PacBioToCA | 13hr 01m |
runCA | 17hr 58m |
Total | 30hr 59m |