In running pacBioToCA, we found that the amount of PBcR was influenced by the parameter of genomeSize, especially when the coverage of PacBio RS data was high (>50X).
Short reads: 118X, long reads: one ~ four SMRT cell reads, with/without genome size.
Name |
m120228_192221 |
m120228_210845 |
Two SMRT cells |
Three SMRT cells |
Four SMRT cells |
Filtered_subreads |
seqs amount:38542 |
seqs amount:44794 |
seqs amount:77117 |
seqs amount:113284 |
seqs amount:136333 |
|
seq avg len:2322.679985 |
seq avg len:2334.414140 |
seq avg len:2184.208709 |
seq avg len:2333.977711 |
seq avg len:2386.664674 |
|
total:89.52 Mb |
total:104.57 Mb |
total:168.44 Mb |
total:264.40 Mb |
total:325.38 Mb |
|
depth: 19.25X |
depth: 22.49X |
depth: 36.22X |
depth: 56.86X |
depth: 69.97X |
without genome size |
|
|
|
|
|
|
seqs amount:35199 |
seqs amount:40811 |
seqs amount:64201 |
seqs amount:99285 |
seqs amount:120296 |
|
seq avg len:2095.143186 |
seq avg len:2086.568670 |
seq avg len:2150.165184 |
seq avg len:2221.782394 |
seq avg len:2252.656963 |
|
total:73.75 Mb |
total:85.15 Mb |
total:138.04 Mb |
total:220.59 Mb |
total:270.99 Mb |
|
depth: 15.86X |
depth: 18.31X |
depth: 29.69X |
depth: 47.44X |
depth: 58.28X |
genomeSize=4650000 |
|
|
|
|
|
|
seqs amount:34852 |
seqs amount:40486 |
seqs amount:63411 |
seqs amount:70468 |
seqs amount:56298 |
|
seq avg len:2130.841559 |
seq avg len:2120.237712 |
seq avg len:2198.455315 |
seq avg len:2815.903020 |
seq avg len:3495.604515 |
|
total:74.26 Mb |
total:85.84 Mb |
total:139.41 Mb |
total:198.43 Mb |
total:196.80 Mb |
|
depth: 15.97X |
depth: 18.46X |
depth: 29.98X |
depth: 42.67X |
depth: 42.32X |
Here, we adjusted Celera Assembler parameters to make overlap detection more stringent (ref).
runCA unitigger=bogart merSize=14 ovlMinLen= <ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg
The <ovl value> parameter was set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000.
Statistics without reference |
192221_asm.ctg |
210845_asm.ctg |
2_asm.ctg |
3_asm.ctg |
4_asm.ctg |
# contigs |
57 |
68 |
82 |
95 |
79 |
Largest contig |
500519 |
1040738 |
671836 |
803212 |
1099298 |
Total length |
4826341 |
4903100 |
5063756 |
5265232 |
5225872 |
N50 |
245130 |
484760 |
462620 |
508833 |
544950 |
Misassemblies |
|
|
|
|
|
# misassemblies |
19 |
26 |
21 |
33 |
36 |
Misassembled contigs length |
987578 |
2460246 |
2026388 |
1623080 |
2824659 |
Mismatches |
|
|
|
|
|
# mismatches per 100 kbp |
9.61 |
9.4 |
9.120 |
10.17 |
7.09 |
# indels per 100 kbp |
2.91 |
2.5 |
2.89 |
1.23 |
0.65 |
# N's per 100 kbp |
0.21 |
0.12 |
0.08 |
0.15 |
0.1 |
Genome statistics |
|
|
|
|
|
Genome fraction (%) |
99.989 |
99.972 |
100 |
100 |
100 |
Duplication ratio |
1.042 |
1.06 |
1.094 |
1.138 |
1.127 |
# genes |
4486 + 11 part |
4492 + 5 part |
4494 + 3 part |
4495 + 2 part |
4495 + 2 part |
NGA50 |
187140 |
484760 |
391908 |
506966 |
526088 |
Statistics without reference |
192221_asm.ctg |
210845_asm.ctg |
2_asm.ctg |
3_asm.ctg |
4_asm.ctg |
# contigs |
26 |
23 |
18 |
17 |
14 |
Largest contig |
500519 |
1040768 |
671836 |
803212 |
1099298 |
Total length |
4663672 |
4668811 |
4688223 |
4693002 |
4703352 |
N50 |
245130 |
484760 |
462620 |
508833 |
544950 |
Misassemblies |
|
|
|
|
|
# misassemblies |
8 |
10 |
12 |
6 |
14 |
Misassembled contigs length |
953282 |
2407429 |
1989199 |
1480741 |
2703788 |
Mismatches |
|
|
|
|
|
# mismatches per 100 kbp |
8.76 |
8.5 |
8.99 |
9.48 |
7.01 |
# indels per 100 kbp |
2.72 |
1.82 |
2.59 |
0.97 |
0.69 |
# N's per 100 kbp |
0.09 |
0.04 |
0.02 |
0.02 |
0 |
Genome statistics |
|
|
|
|
|
Genome fraction (%) |
99.895 |
99.596 |
99.746 |
99.548 |
99.931 |
Duplication ratio |
1.006 |
1.012 |
1.015 |
1.017 |
1.015 |
# genes |
4478 + 16 part |
4474 + 12 part |
4477 + 9 part |
4470 + 8 part |
4489 + 6 part |
NGA50 |
187140 |
484760 |
391908 |
506966 |
526088 |
The PBcR were filtered to 25X and then assembled by runCA.
Name |
Two SMRT cells |
Three SMRT cells |
Four SMRT cells |
genomeSize=4650000, 25X |
seqs amount:40382 |
seqs amount:24448 |
seqs amount:21641 |
|
seq avg len:2878.787529 |
seq avg len:4754.996196 |
seq avg len:5371.762719 |
|
total:116.25 Mb |
total:116.25 Mb |
total:116.25 Mb |
|
depth: 25.00X |
depth: 25.00X |
depth: 25.00X |
Statistics without reference |
2_25X_asm.ctg |
3_25X_asm.ctg |
4_25X_asm.ctg |
# contigs |
50 |
44 |
42 |
Largest contig |
1214624 |
1487482 |
2034167 |
Total length |
4914251 |
4967382 |
4937523 |
N50 |
672326 |
1340503 |
907105 |
Misassemblies |
|
|
|
# misassemblies |
16 |
25 |
26 |
Misassembled contigs length |
1470460 |
2166145 |
3149834 |
Mismatches |
|
|
|
# mismatches per 100 kbp |
7.41 |
5.99 |
6.14 |
# indels per 100 kbp |
1.47 |
0.71 |
2.76 |
# N's per 100 kbp |
0 |
0.1 |
0.06 |
Genome statistics |
|
|
|
Genome fraction (%) |
100 |
100 |
99.995 |
Duplication ratio |
1.061 |
1.071 |
1.066 |
# genes |
4492 + 5 part |
4495 + 2 part |
4495 + 2 part |
NGA50 |
367128 |
1340503 |
562428 |
Statistics without reference |
2_25X_asm.ctg |
3_25X_asm.ctg |
4_25X_asm.ctg |
# contigs |
17 |
6 |
10 |
Largest contig |
1214624 |
1487482 |
2034167 |
Total length |
4691318 |
4671508 |
4678847 |
N50 |
672326 |
1340503 |
904105 |
Misassemblies |
|
|
|
# misassemblies |
8 |
7 |
10 |
Misassembled contigs length |
1424020 |
2055756 |
3075921 |
Mismatches |
|
|
|
# mismatches per 100 kbp |
7.13 |
6.05 |
6.53 |
# indels per 100 kbp |
1.59 |
0.69 |
2.87 |
# N's per 100 kbp |
0 |
0.06 |
0 |
Genome statistics |
|
|
|
Genome fraction (%) |
99.997 |
99.824 |
99.961 |
Duplication ratio |
1.011 |
1.009 |
1.009 |
# genes |
4488 + 9 part |
4485 + 4 part |
4491 + 4 part |
NGA50 |
367128 |
1340503 |
562428 |
Summary
- If you have high-coverage PacBio RS data (>50X), you should specify genomeSize in running pacBioToCA.
- If you have high-coverage PBcR, you should filter 25X of the longest post-correction sequences for assembly.
- Although we have high-coverage PacBio RS data (four SMRT cells, ~70X subreads) and follow the procedure (ref) to process the data, we did not get complete genome assemblies.