In running pacBioToCA, we found that the amount of PBcR was influenced by the parameter of genomeSize, especially when the coverage of PacBio RS data was high (>50X).
Short reads: 118X, long reads: one ~ four SMRT cell reads, with/without genome size.
Name | m120228_192221 | m120228_210845 | Two SMRT cells | Three SMRT cells | Four SMRT cells |
Filtered_subreads | seqs amount:38542 | seqs amount:44794 | seqs amount:77117 | seqs amount:113284 | seqs amount:136333 |
seq avg len:2322.679985 | seq avg len:2334.414140 | seq avg len:2184.208709 | seq avg len:2333.977711 | seq avg len:2386.664674 | |
total:89.52 Mb | total:104.57 Mb | total:168.44 Mb | total:264.40 Mb | total:325.38 Mb | |
depth: 19.25X | depth: 22.49X | depth: 36.22X | depth: 56.86X | depth: 69.97X | |
without genome size | |||||
seqs amount:35199 | seqs amount:40811 | seqs amount:64201 | seqs amount:99285 | seqs amount:120296 | |
seq avg len:2095.143186 | seq avg len:2086.568670 | seq avg len:2150.165184 | seq avg len:2221.782394 | seq avg len:2252.656963 | |
total:73.75 Mb | total:85.15 Mb | total:138.04 Mb | total:220.59 Mb | total:270.99 Mb | |
depth: 15.86X | depth: 18.31X | depth: 29.69X | depth: 47.44X | depth: 58.28X | |
genomeSize=4650000 | |||||
seqs amount:34852 | seqs amount:40486 | seqs amount:63411 | seqs amount:70468 | seqs amount:56298 | |
seq avg len:2130.841559 | seq avg len:2120.237712 | seq avg len:2198.455315 | seq avg len:2815.903020 | seq avg len:3495.604515 | |
total:74.26 Mb | total:85.84 Mb | total:139.41 Mb | total:198.43 Mb | total:196.80 Mb | |
depth: 15.97X | depth: 18.46X | depth: 29.98X | depth: 42.67X | depth: 42.32X |
Here, we adjusted Celera Assembler parameters to make overlap detection more stringent (ref).
runCA unitigger=bogart merSize=14 ovlMinLen= <ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg
The <ovl value> parameter was set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000.
We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).
Statistics without reference | 192221_asm.ctg | 210845_asm.ctg | 2_asm.ctg | 3_asm.ctg | 4_asm.ctg |
# contigs | 57 | 68 | 82 | 95 | 79 |
Largest contig | 500519 | 1040738 | 671836 | 803212 | 1099298 |
Total length | 4826341 | 4903100 | 5063756 | 5265232 | 5225872 |
N50 | 245130 | 484760 | 462620 | 508833 | 544950 |
Misassemblies | |||||
# misassemblies | 19 | 26 | 21 | 33 | 36 |
Misassembled contigs length | 987578 | 2460246 | 2026388 | 1623080 | 2824659 |
Mismatches | |||||
# mismatches per 100 kbp | 9.61 | 9.4 | 9.120 | 10.17 | 7.09 |
# indels per 100 kbp | 2.91 | 2.5 | 2.89 | 1.23 | 0.65 |
# N's per 100 kbp | 0.21 | 0.12 | 0.08 | 0.15 | 0.1 |
Genome statistics | |||||
Genome fraction (%) | 99.989 | 99.972 | 100 | 100 | 100 |
Duplication ratio | 1.042 | 1.06 | 1.094 | 1.138 | 1.127 |
# genes | 4486 + 11 part | 4492 + 5 part | 4494 + 3 part | 4495 + 2 part | 4495 + 2 part |
NGA50 | 187140 | 484760 | 391908 | 506966 | 526088 |
We discarded the contigs which fewer than 100 reads aligned. more detail
Statistics without reference | 192221_asm.ctg | 210845_asm.ctg | 2_asm.ctg | 3_asm.ctg | 4_asm.ctg |
# contigs | 26 | 23 | 18 | 17 | 14 |
Largest contig | 500519 | 1040768 | 671836 | 803212 | 1099298 |
Total length | 4663672 | 4668811 | 4688223 | 4693002 | 4703352 |
N50 | 245130 | 484760 | 462620 | 508833 | 544950 |
Misassemblies | |||||
# misassemblies | 8 | 10 | 13 | 9 | 11 |
Misassembled contigs length | 708152 | 2407429 | 1389781 | 1480741 | 2433888 |
Mismatches | |||||
# mismatches per 100 kbp | 8.76 | 8.5 | 8.99 | 9.48 | 7.01 |
# indels per 100 kbp | 2.57 | 1.64 | 2.03 | 0.97 | 0.67 |
# N's per 100 kbp | 0.09 | 0.04 | 0.02 | 0.02 | 0 |
Genome statistics | |||||
Genome fraction (%) | 99.895 | 99.596 | 99.746 | 99.548 | 99.931 |
Duplication ratio | 1.006 | 1.011 | 1.014 | 1.016 | 1.014 |
# genes | 4478 + 16 part | 4474 + 12 part | 4477 + 9 part | 4470 + 8 part | 4489 + 6 part |
NGA50 | 187140 | 407626 | 275303 | 477000 | 526088 |
The PBcR were filtered to 25X and then assembled by runCA.
Name | Two SMRT cells | Three SMRT cells | Four SMRT cells |
genomeSize=4650000, 25X | seqs amount:40382 | seqs amount:24448 | seqs amount:21641 |
seq avg len:2878.787529 | seq avg len:4754.996196 | seq avg len:5371.762719 | |
total:116.25 Mb | total:116.25 Mb | total:116.25 Mb | |
depth: 25.00X | depth: 25.00X | depth: 25.00X |
Statistics without reference | 2_25X_asm.ctg | 3_25X_asm.ctg | 4_25X_asm.ctg |
# contigs | 50 | 44 | 42 |
Largest contig | 1214624 | 1487482 | 2034167 |
Total length | 4914251 | 4967382 | 4937523 |
N50 | 672326 | 1340503 | 907105 |
Misassemblies | |||
# misassemblies | 16 | 25 | 26 |
Misassembled contigs length | 1470460 | 2166145 | 3149834 |
Mismatches | |||
# mismatches per 100 kbp | 7.41 | 5.99 | 6.14 |
# indels per 100 kbp | 1.47 | 0.71 | 2.76 |
# N's per 100 kbp | 0 | 0.1 | 0.06 |
Genome statistics | |||
Genome fraction (%) | 100 | 100 | 99.995 |
Duplication ratio | 1.061 | 1.071 | 1.066 |
# genes | 4492 + 5 part | 4495 + 2 part | 4495 + 2 part |
NGA50 | 367128 | 1340503 | 562428 |
We discarded the contigs which fewer than 100 reads aligned. more detail
Statistics without reference | 2_25X_asm.ctg | 3_25X_asm.ctg | 4_25X_asm.ctg |
# contigs | 17 | 6 | 10 |
Largest contig | 1214624 | 1487482 | 2034167 |
Total length | 4691318 | 4671508 | 4678847 |
N50 | 672326 | 1340503 | 907105 |
Misassemblies | |||
# misassemblies | 10 | 6 | 10 |
Misassembled contigs length | 1424020 | 2055756 | 3075921 |
Mismatches | |||
# mismatches per 100 kbp | 7.13 | 6.05 | 6.53 |
# indels per 100 kbp | 1.55 | 0.67 | 1.27 |
# N's per 100 kbp | 0 | 0.06 | 0 |
Genome statistics | |||
Genome fraction (%) | 99.997 | 99.824 | 99.961 |
Duplication ratio | 1.011 | 1.009 | 1.01 |
# genes | 4488 + 9 part | 4485 + 4 part | 4491 + 4 part |
NGA50 | 315240 | 1008667 | 572351 |