PBcR pipeline

PBcR pipeline(S) was proposed in the ref (Reducing assembly complexity of microbial genomes with single-molecule sequencin, Genome Biology 2013).

Contents [hide]
1 PBcR pipeline 2 Dataset 5 (E. coli K-12 MG1655, 17 SMRT cells) 2.1 Performance 3 Dataset 6 (E.coli K-12 MG1655, 8 SMRT cells) 3.1 Performance 4 Dataset 7, (M. ruber DSM1279, 4 SMRT cells) 4.1 Performance 5 Dataset 8 (P. heparinus DSM1279, 7 SMRT cells) 5.1 Performance 6 Dataset 9 (E. coli K-12, P4-C2 chemistry, 20 Kbp, 1 SMRT cell) 6.1 Performance

PBcR pipeline

Following we used E. coli as an example to show the steps. All the necessary programs were downloaded from cbcb (or direct download the package) and more detail information was described at PacBioToCA

1. Long reads self-correction

pacBioToCA -length 500 -partitions 200 -l pacbio -t 6 -s pacbio.spec -fastq Filtered_four.fastq longReads=1 genomeSize=4650000

2. Trimming the corrected long reads

python trimFastqByQVWindow.py --qvCut=54.5 --out=trimmed.pacbio.fastq pacbio.fastq

3. Convert the data format from fastq to frg

java convertFastqToFastaAndQual trimmed.pacbio.fastq trimmed.pacbio.fasta trimmed.pacbio.qual
convert-fasta-to-v2.pl -l Pacbio -s trimmed.pacbio.fasta -q trimmed.pacbio.qual > trimmed.pacbio.frg

4. Select 25X longest corrected long reads

gatekeeper -T -F -o asm.gkpStore trimmed.pacbio.frg
gatekeeper -dumpfasta 25X_Clr -longestlength 0 116250000 asm.gkpStore
gatekeeper -dumpfrg -longestlength 0 116250000 asm.gkpStore > 25X_Clr.frg

5. Assemble

runCA -p asm -d asm -s asm.spec 25X_Clr.frg

Dataset 5 (E. coli K-12 MG1655, 17 SMRT cells)

We used all SMRT cells and randomly selected four, six and eight SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference	All Data	4 SMRT cells : 1st Set	4 SMRT cells : 2nd Set	4 SMRT cells : 3rd Set	6 SMRT cells : 1st Set	6 SMRT cells : 2nd Set	6 SMRT cells : 3rd Set	8 SMRT cells : 1st Set	8 SMRT cells : 2nd Set	8 SMRT cells : 3rd Set
# contigs	1	1	1	5	2	2	1	4	1	2
Largest contig	4 651 604	4 647 117	4 648 057	3 447 068	3 749 516	2 770 859	4 649 699	1 679 082	4 649 323	4 189 785
Total length	4 651 604	4 647 117	4 648 057	4 661 453	4 645 941	4 657 272	4 649 699	4 655 949	4 649 323	4 652 482
N50	4 651 604	4 647 117	4 648 057	3 447 068	3 749 516	2 770 859	4 649 699	1 159 845	4 649 323	4 189 785
Misassemblies
# misassemblies	9	9	8	9	7	9	10	7	10	8
Misassembled contigs length	4 651 604	4 647 117	4 648 057	3 447 068	4 645 941	4 657 272	4 649 699	2 143 406	4 649 323	4 189 785
Mismatches
# mismatches per 100kbp	0.34	1.03	0.69	0.78	0.69	0.56	0.58	0.75	0.84	0.75
# indels per 100kbp	0.6	7.44	5.78	5.67	1.88	2.72	1.66	1.6	1.9	2.65
# N's per 100kbp	0	0.02	0	0	0	0	0	0	0.02	0
Genome Statistics
Genome fraction(%)	100	100	100	99.949	99.956	100	100	99.959	100	99.993
Duplication ratio	1.003	1.002	1.002	1.005	1.002	1.005	1.002	1.004	1.002	1.003
# genes	4494+3 part	4494+3 part	4494+3 part	4490+5 part	4491+2 part	4495+2 part	4494+3 part	4489+6 part	4494+3 part	4493+4 part
NGA50	2 834 925	949 217	949 242	656 513	2 796 469	1 040 965	3 026 388	949 298	3 027 267	949 289
Running Time
PacBioToCA	48hr 16m	4hr 58m	5hr 48m	5hr 10m	11hr 09m	9hr 34m	10hr 47m	21hr 06m	22hr 05m	21hr 23m
runCA	15hr 48m	15hr 22m	13hr 50m	11hr 20m	12hr 38m	11hr 44m	13hr 48m	11hr 37m	14hr 36m	13hr 40m
Total	64hr 04m	20hr 20m	19hr 38m	16hr 30m	23hr 47m	21hr 18m	24hr 35m	32hr 43m	36hr 41m	25hr 03m

Dataset 6 (E.coli K-12 MG1655, 8 SMRT cells)

We used all SMRT cells and randomly selected four and six SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference	All Data	4 SMRT cells : 1st Set	4 SMRT cells : 2nd Set	4 SMRT cells : 3rd Set	6 SMRT cells : 1st Set	6 SMRT cells : 2nd Set	6 SMRT cells : 3rd Set
# contigs	2	8	10	14	1	1	4
Largest contig	4 278 957	2 277 010	1 213 670	984 459	4 641 350	4 640 250	3 162 440
Total length	4 650 771	4 648 304	4 644 602	4 656 274	4 641 350	4 640 250	4 653 394
N50	4 278 957	622 425	800 993	565 251	4 641 350	4 640 250	3 162 440
Misassemblies
# misassemblies	9	9	9	8	8	8	9
Misassembled contigs length	4 278 957	2 809 129	2 085 482	1 947 163	4 641 350	4 640 250	3 209 090
Mismatches
# mismatches per 100kbp	0.37	2.49	1.88	5.38	0.69	0.67	0.86
# indels per 100kbp	3.58	53.34	45.82	73.07	10.65	11.28	10.46
# N's per 100kbp	0	0.04	0.02	0.09	0	0	0
Genome Statistics
Genome fraction(%)	99.993	99.733	99.67	99.693	99.972	99.946	99.968
Duplication ratio	1.002	1.005	1.006	1.007	1.001	1.001	1.003
# genes	4492+5 part	4475+10 part	4467+12 part	4469+13 part	4492+4 part	4491+4 part	4492+4 part
NGA50	859 464	621 281	572 455	436 292	1 098 529	1 096 784	859 502
Running Time
pacBioToCA	20hr 03m	5hr 52m	6hr 05m	5hr 19m	15hr 53m	14hr 47m	15hr 38m
runCA	15hr 41m	7hr 32m	7hr 10m	5hr 42m	15hr 44m	16hr 02m	13hr 27m
Total	35hr 44m	13hr 24m	13hr 15m	11hr 01m	31hr 37m	30hr 49m	29hr 05m

Misassemblies for Adobe reader.

Dataset 7, (M. ruber DSM1279, 4 SMRT cells)

We used all SMRT cells to do assembly and evaluated the assemblies by QUAST against the reference genome (NC_013946) and Mr_gene_list. (more detail)

Performance

Statistics without reference	All Data
# contigs	2
Largest contig	2 974 307
Total length	3 100 289
N50	2 974 307
Misassemblies
# misassemblies	3
Misassembled contigs length	2 974 307
Mismatches
# mismatches per 100kbp	0.23
# indels per 100kbp	5.01
# N's per 100kbp	0.03
Genome Statistics
Genome fraction(%)	99.883
Duplication ratio	1.002
# genes	3093+4 part
NGA50	1 707 938
Running Time
pacBioToCA	7hr 35m
runCA	8hr 7m
Total	15hr 42m

Dataset 8 (P. heparinus DSM1279, 7 SMRT cells)

We used all SMRT cells and randomly selected four SMRT cells three times for each, and evaluated the assemblies by QUAST against the reference genome (NC_013061) and Ph_gene_list. (more detail)

Performance

Statistics without reference	All Data	4 SMRT cells : 1st Set	4 SMRT cells : 2nd Set	4 SMRT cells : 3rd Set
# contigs	1	3	3	3
Largest contig	5 163 983	2 232 679	2 236 613	2 237 949
Total length	5 163 983	5 161 276	5 165 518	5 166 563
N50	5 163 983	2 043 590	2 044 147	2 135 225
Misassemblies
# misassemblies	0	0	0	0
Misassembled contigs length	0	0	0	0
Mismatches
# mismatches per 100kbp	8.41	9.960	8.27	10.29
# indels per 100kbp	2.19	18.99	13.13	14.01
# N's per 100kbp	0	0	0	0
Genome Statistics
Genome fraction(%)	99.919	99.864	99.907	99.89
Duplication ratio	1	1	1.001	1.001
# genes	4335+3 part	4330+5 part	4333+5 part	4333+3 part
NGA50	5 163 983 532	2 043 590	2 044 147	2 135 225
Running Time
pacBioToCA	18hr 55m	6hr 27m	6hr 34m	6hr 31m
runCA	21hr 36m	11hr 39m	12hr 26m	12hr 12m
Total	40hr 31m	18hr 06m	19hr 00n	18hr 43m

Misassemblies for Adobe reader.

Dataset 9 (E. coli K-12, P4-C2 chemistry, 20 Kbp, 1 SMRT cell)

We used all SMRT cells and evaluated the assemblies by QUAST against the reference genome (NC_000913) and Ec_gene_list. (more detail)

Performance

Statistics without reference	All Data
# contigs	1
Largest contig	4 656257
Total length	4 656 257
N50	4 656 257
Misassemblies
# misassemblies	8
Misassembled contigs length	4 656 257
Mismatches
# mismatches per 100kbp	0.22
# indels per 100kbp	13.15
# N's per 100kbp	0
Genome Statistics
Genome fraction(%)	100
Duplication ratio	1.004
# genes	4494+3 part
NGA50	3 026 094
Running Time
PacBioToCA	13hr 01m
runCA	17hr 58m
Total	30hr 59m