PBcR pipeline

PBcR pipeline(S) was proposed in the ref (Reducing assembly complexity of microbial genomes with single-molecule sequencin, Genome Biology 2013).

Contents

PBcR pipeline

Following we used E. coli as an example to show the steps. All the necessary programs were downloaded from cbcb (or direct download the package) and more detail information was described at PacBioToCA

1. Long reads self-correction

pacBioToCA -length 500 -partitions 200 -l pacbio -t 6 -s pacbio.spec -fastq Filtered_four.fastq longReads=1 genomeSize=4650000

2. Trimming the corrected long reads

python trimFastqByQVWindow.py --qvCut=54.5 --out=trimmed.pacbio.fastq pacbio.fastq

3. Convert the data format from fastq to frg

java convertFastqToFastaAndQual trimmed.pacbio.fastq trimmed.pacbio.fasta trimmed.pacbio.qual
convert-fasta-to-v2.pl -l Pacbio -s trimmed.pacbio.fasta -q trimmed.pacbio.qual > trimmed.pacbio.frg

4. Select 25X longest corrected long reads

gatekeeper -T -F -o asm.gkpStore trimmed.pacbio.frg
gatekeeper -dumpfasta 25X_Clr -longestlength 0 116250000 asm.gkpStore
gatekeeper -dumpfrg -longestlength 0 116250000 asm.gkpStore > 25X_Clr.frg

5. Assemble

runCA -p asm -d asm -s asm.spec 25X_Clr.frg

Dataset 5 (E. coli K-12 MG1655, 17 SMRT cells)

We used all SMRT cells and randomly selected four, six and eight SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set 8 SMRT cells : 1st Set 8 SMRT cells : 2nd Set 8 SMRT cells : 3rd Set
# contigs 1 1 1 5 2 2 1 4 1 2
Largest contig 4 651 604 4 647 117 4 648 057 3 447 068 3 749 516 2 770 859 4 649 699 1 679 082 4 649 323 4 189 785
Total length 4 651 604 4 647 117 4 648 057 4 661 453 4 645 941 4 657 272 4 649 699 4 655 949 4 649 323 4 652 482
N50 4 651 604 4 647 117 4 648 057 3 447 068 3 749 516 2 770 859 4 649 699 1 159 845 4 649 323 4 189 785
Misassemblies
# misassemblies 9 9 8 9 7 9 10 7 10 8
Misassembled contigs length 4 651 604 4 647 117 4 648 057 3 447 068 4 645 941 4 657 272 4 649 699 2 143 406 4 649 323 4 189 785
Mismatches
# mismatches per 100kbp 0.34 1.03 0.69 0.78 0.69 0.56 0.58 0.75 0.84 0.75
# indels per 100kbp 0.6 7.44 5.78 5.67 1.88 2.72 1.66 1.6 1.9 2.65
# N's per 100kbp 0 0.02 0 0 0 0 0 0 0.02 0
Genome Statistics
Genome fraction(%) 100 100 100 99.949 99.956 100 100 99.959 100 99.993
Duplication ratio 1.003 1.002 1.002 1.005 1.002 1.005 1.002 1.004 1.002 1.003
# genes 4494+3 part 4494+3 part 4494+3 part 4490+5 part 4491+2 part 4495+2 part 4494+3 part 4489+6 part 4494+3 part 4493+4 part
NGA50 2 834 925 949 217 949 242 656 513 2 796 469 1 040 965 3 026 388 949 298 3 027 267 949 289
Running Time
PacBioToCA 48hr 16m 4hr 58m 5hr 48m 5hr 10m 11hr 09m 9hr 34m 10hr 47m 21hr 06m 22hr 05m 21hr 23m
runCA 15hr 48m 15hr 22m 13hr 50m 11hr 20m 12hr 38m 11hr 44m 13hr 48m 11hr 37m 14hr 36m 13hr 40m
Total 64hr 04m 20hr 20m 19hr 38m 16hr 30m 23hr 47m 21hr 18m 24hr 35m 32hr 43m 36hr 41m 25hr 03m


Dataset 6 (E.coli K-12 MG1655, 8 SMRT cells)

We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set 6 SMRT cells : 1st Set 6 SMRT cells : 2nd Set 6 SMRT cells : 3rd Set
# contigs 2 8 10 14 1 1 4
Largest contig 4 278 957 2 277 010 1 213 670 984 459 4 641 350 4 640 250 3 162 440
Total length 4 650 771 4 648 304 4 644 602 4 656 274 4 641 350 4 640 250 4 653 394
N50 4 278 957 622 425 800 993 565 251 4 641 350 4 640 250 3 162 440
Misassemblies
# misassemblies 9 9 9 8 8 8 9
Misassembled contigs length 4 278 957 2 809 129 2 085 482 1 947 163 4 641 350 4 640 250 3 209 090
Mismatches
# mismatches per 100kbp 0.37 2.49 1.88 5.38 0.69 0.67 0.86
# indels per 100kbp 3.58 53.34 45.82 73.07 10.65 11.28 10.46
# N's per 100kbp 0 0.04 0.02 0.09 0 0 0
Genome Statistics
Genome fraction(%) 99.993 99.733 99.67 99.693 99.972 99.946 99.968
Duplication ratio 1.002 1.005 1.006 1.007 1.001 1.001 1.003
# genes 4492+5 part 4475+10 part 4467+12 part 4469+13 part 4492+4 part 4491+4 part 4492+4 part
NGA50 859 464 621 281 572 455 436 292 1 098 529 1 096 784 859 502
Running Time
pacBioToCA 20hr 03m 5hr 52m 6hr 05m 5hr 19m 15hr 53m 14hr 47m 15hr 38m
runCA 15hr 41m 7hr 32m 7hr 10m 5hr 42m 15hr 44m 16hr 02m 13hr 27m
Total 35hr 44m 13hr 24m 13hr 15m 11hr 01m 31hr 37m 30hr 49m 29hr 05m

Misassemblies for Adobe reader.

Dataset 7, (M. ruber DSM1279, 4 SMRT cells)

We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome (NC_013946) and Mr_gene_list. (more detail)

Performance

Statistics without reference All Data
# contigs 2
Largest contig 2 974 307
Total length 3 100 289
N50 2 974 307
Misassemblies
# misassemblies 3
Misassembled contigs length 2 974 307
Mismatches
# mismatches per 100kbp 0.23
# indels per 100kbp 5.01
# N's per 100kbp 0.03
Genome Statistics
Genome fraction(%) 99.883
Duplication ratio 1.002
# genes 3093+4 part
NGA50 1 707 938
Running Time
pacBioToCA 7hr 35m
runCA 8hr 7m
Total 15hr 42m

Dataset 8 (P. heparinus DSM1279, 7 SMRT cells)

We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_013061) and Ph_gene_list. (more detail)

Performance

Statistics without reference All Data 4 SMRT cells : 1st Set 4 SMRT cells : 2nd Set 4 SMRT cells : 3rd Set
# contigs 1 3 3 3
Largest contig 5 163 983 2 232 679 2 236 613 2 237 949
Total length 5 163 983 5 161 276 5 165 518 5 166 563
N50 5 163 983 2 043 590 2 044 147 2 135 225
Misassemblies
# misassemblies 0 0 0 0
Misassembled contigs length 0 0 0 0
Mismatches
# mismatches per 100kbp 8.41 9.960 8.27 10.29
# indels per 100kbp 2.19 18.99 13.13 14.01
# N's per 100kbp 0 0 0 0
Genome Statistics
Genome fraction(%) 99.919 99.864 99.907 99.89
Duplication ratio 1 1 1.001 1.001
# genes 4335+3 part 4330+5 part 4333+5 part 4333+3 part
NGA50 5 163 983 532 2 043 590 2 044 147 2 135 225
Running Time
pacBioToCA 18hr 55m 6hr 27m 6hr 34m 6hr 31m
runCA 21hr 36m 11hr 39m 12hr 26m 12hr 12m
Total 40hr 31m 18hr 06m 19hr 00n 18hr 43m

Misassemblies for Adobe reader.

Dataset 9 (E. coli K-12, P4-C2 chemistry, 20 Kbp, 1 SMRT cell)

We used all SMRT cells and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference All Data
# contigs 1
Largest contig 4 656257
Total length 4 656 257
N50 4 656 257
Misassemblies
# misassemblies 8
Misassembled contigs length 4 656 257
Mismatches
# mismatches per 100kbp 0.22
# indels per 100kbp 13.15
# N's per 100kbp 0
Genome Statistics
Genome fraction(%) 100
Duplication ratio 1.004
# genes 4494+3 part
NGA50 3 026 094
Running Time
PacBioToCA 13hr 01m
runCA 17hr 58m
Total 30hr 59m