PacBioToCA

In running pacBioToCA, we found that the amount of PBcR was influenced by the parameter of genomeSize, especially when the coverage of PacBio RS data was high (>50X).

Short reads: 118X, long reads: one ~ four SMRT cell reads, with/without genome size.


Name m120228_192221 m120228_210845 Two SMRT cells Three SMRT cells Four SMRT cells
Filtered_subreads seqs amount:38542 seqs amount:44794 seqs amount:77117 seqs amount:113284 seqs amount:136333
seq avg len:2322.679985 seq avg len:2334.414140 seq avg len:2184.208709 seq avg len:2333.977711 seq avg len:2386.664674
total:89.52 Mb total:104.57 Mb total:168.44 Mb total:264.40 Mb total:325.38 Mb
depth: 19.25X depth: 22.49X depth: 36.22X depth: 56.86X depth: 69.97X
without genome size
seqs amount:35199 seqs amount:40811 seqs amount:64201 seqs amount:99285 seqs amount:120296
seq avg len:2095.143186 seq avg len:2086.568670 seq avg len:2150.165184 seq avg len:2221.782394 seq avg len:2252.656963
total:73.75 Mb total:85.15 Mb total:138.04 Mb total:220.59 Mb total:270.99 Mb
depth: 15.86X depth: 18.31X depth: 29.69X depth: 47.44X depth: 58.28X
genomeSize=4650000
seqs amount:34852 seqs amount:40486 seqs amount:63411 seqs amount:70468 seqs amount:56298
seq avg len:2130.841559 seq avg len:2120.237712 seq avg len:2198.455315 seq avg len:2815.903020 seq avg len:3495.604515
total:74.26 Mb total:85.84 Mb total:139.41 Mb total:198.43 Mb total:196.80 Mb
depth: 15.97X depth: 18.46X depth: 29.98X depth: 42.67X depth: 42.32X

Here, we adjusted Celera Assembler parameters to make overlap detection more stringent (ref).

runCA unitigger=bogart merSize=14 ovlMinLen= <ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg

The <ovl value> parameter was set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000.


Evaluation

We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).

Statistics without reference 192221_asm.ctg 210845_asm.ctg 2_asm.ctg 3_asm.ctg 4_asm.ctg
# contigs 57 68 82 95 79
Largest contig 500519 1040738 671836 803212 1099298
Total length 4826341 4903100 5063756 5265232 5225872
N50 245130 484760 462620 508833 544950
Misassemblies
# misassemblies 19 26 21 33 36
Misassembled contigs length 987578 2460246 2026388 1623080 2824659
Mismatches
# mismatches per 100 kbp 9.61 9.4 9.120 10.17 7.09
# indels per 100 kbp 2.91 2.5 2.89 1.23 0.65
# N's per 100 kbp 0.21 0.12 0.08 0.15 0.1
Genome statistics
Genome fraction (%) 99.989 99.972 100 100 100
Duplication ratio 1.042 1.06 1.094 1.138 1.127
# genes 4486 + 11 part 4492 + 5 part 4494 + 3 part 4495 + 2 part 4495 + 2 part
NGA50 187140 484760 391908 506966 526088

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference 192221_asm.ctg 210845_asm.ctg 2_asm.ctg 3_asm.ctg 4_asm.ctg
# contigs 26 23 18 17 14
Largest contig 500519 1040768 671836 803212 1099298
Total length 4663672 4668811 4688223 4693002 4703352
N50 245130 484760 462620 508833 544950
Misassemblies
# misassemblies 8 10 13 9 11
Misassembled contigs length 708152 2407429 1389781 1480741 2433888
Mismatches
# mismatches per 100 kbp 8.76 8.5 8.99 9.48 7.01
# indels per 100 kbp 2.57 1.64 2.03 0.97 0.67
# N's per 100 kbp 0.09 0.04 0.02 0.02 0
Genome statistics
Genome fraction (%) 99.895 99.596 99.746 99.548 99.931
Duplication ratio 1.006 1.011 1.014 1.016 1.014
# genes 4478 + 16 part 4474 + 12 part 4477 + 9 part 4470 + 8 part 4489 + 6 part
NGA50 187140 407626 275303 477000 526088


The PBcR were filtered to 25X and then assembled by runCA.

Name Two SMRT cells Three SMRT cells Four SMRT cells
genomeSize=4650000, 25X seqs amount:40382 seqs amount:24448 seqs amount:21641
seq avg len:2878.787529 seq avg len:4754.996196 seq avg len:5371.762719
total:116.25 Mb total:116.25 Mb total:116.25 Mb
depth: 25.00X depth: 25.00X depth: 25.00X


Statistics without reference 2_25X_asm.ctg 3_25X_asm.ctg 4_25X_asm.ctg
# contigs 50 44 42
Largest contig 1214624 1487482 2034167
Total length 4914251 4967382 4937523
N50 672326 1340503 907105
Misassemblies
# misassemblies 16 25 26
Misassembled contigs length 1470460 2166145 3149834
Mismatches
# mismatches per 100 kbp 7.41 5.99 6.14
# indels per 100 kbp 1.47 0.71 2.76
# N's per 100 kbp 0 0.1 0.06
Genome statistics
Genome fraction (%) 100 100 99.995
Duplication ratio 1.061 1.071 1.066
# genes 4492 + 5 part 4495 + 2 part 4495 + 2 part
NGA50 367128 1340503 562428

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference 2_25X_asm.ctg 3_25X_asm.ctg 4_25X_asm.ctg
# contigs 17 6 10
Largest contig 1214624 1487482 2034167
Total length 4691318 4671508 4678847
N50 672326 1340503 907105
Misassemblies
# misassemblies 10 6 10
Misassembled contigs length 1424020 2055756 3075921
Mismatches
# mismatches per 100 kbp 7.13 6.05 6.53
# indels per 100 kbp 1.55 0.67 1.27
# N's per 100 kbp 0 0.06 0
Genome statistics
Genome fraction (%) 99.997 99.824 99.961
Duplication ratio 1.011 1.009 1.01
# genes 4488 + 9 part 4485 + 4 part 4491 + 4 part
NGA50 315240 1008667 572351

Summary

  1. If you have high-coverage PacBio RS data (>50X), you should specify genomeSize in running pacBioToCA.
  2. If you have high-coverage PBcR, you should filter 25X of the longest post-correction sequences for assembly.
  3. Although we have high-coverage PacBio RS data (four SMRT cells, ~70X subreads) and follow the procedure (ref) to process the data, we did not get complete genome assemblies.