PacBioToCA

In running pacBioToCA, we found that the amount of PBcR was influenced by the parameter of genomeSize, especially when the coverage of PacBio RS data was high (>50X).

Short reads: 118X, long reads: one ~ four SMRT cell reads, with/without genome size.

Name	m120228_192221	m120228_210845	Two SMRT cells	Three SMRT cells	Four SMRT cells
Filtered_subreads	seqs amount:38542	seqs amount:44794	seqs amount:77117	seqs amount:113284	seqs amount:136333
	seq avg len:2322.679985	seq avg len:2334.414140	seq avg len:2184.208709	seq avg len:2333.977711	seq avg len:2386.664674
	total:89.52 Mb	total:104.57 Mb	total:168.44 Mb	total:264.40 Mb	total:325.38 Mb
	depth: 19.25X	depth: 22.49X	depth: 36.22X	depth: 56.86X	depth: 69.97X
without genome size
	seqs amount:35199	seqs amount:40811	seqs amount:64201	seqs amount:99285	seqs amount:120296
	seq avg len:2095.143186	seq avg len:2086.568670	seq avg len:2150.165184	seq avg len:2221.782394	seq avg len:2252.656963
	total:73.75 Mb	total:85.15 Mb	total:138.04 Mb	total:220.59 Mb	total:270.99 Mb
	depth: 15.86X	depth: 18.31X	depth: 29.69X	depth: 47.44X	depth: 58.28X
genomeSize=4650000
	seqs amount:34852	seqs amount:40486	seqs amount:63411	seqs amount:70468	seqs amount:56298
	seq avg len:2130.841559	seq avg len:2120.237712	seq avg len:2198.455315	seq avg len:2815.903020	seq avg len:3495.604515
	total:74.26 Mb	total:85.84 Mb	total:139.41 Mb	total:198.43 Mb	total:196.80 Mb
	depth: 15.97X	depth: 18.46X	depth: 29.98X	depth: 42.67X	depth: 42.32X

Here, we adjusted Celera Assembler parameters to make overlap detection more stringent (ref).

runCA unitigger=bogart merSize=14 ovlMinLen= <ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg

The <ovl value> parameter was set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000.

Evaluation

We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).

Statistics without reference	192221_asm.ctg	210845_asm.ctg	2_asm.ctg	3_asm.ctg	4_asm.ctg
# contigs	57	68	82	95	79
Largest contig	500519	1040738	671836	803212	1099298
Total length	4826341	4903100	5063756	5265232	5225872
N50	245130	484760	462620	508833	544950
Misassemblies
# misassemblies	19	26	21	33	36
Misassembled contigs length	987578	2460246	2026388	1623080	2824659
Mismatches
# mismatches per 100 kbp	9.61	9.4	9.120	10.17	7.09
# indels per 100 kbp	2.91	2.5	2.89	1.23	0.65
# N's per 100 kbp	0.21	0.12	0.08	0.15	0.1
Genome statistics
Genome fraction (%)	99.989	99.972	100	100	100
Duplication ratio	1.042	1.06	1.094	1.138	1.127
# genes	4486 + 11 part	4492 + 5 part	4494 + 3 part	4495 + 2 part	4495 + 2 part
NGA50	187140	484760	391908	506966	526088

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference	192221_asm.ctg	210845_asm.ctg	2_asm.ctg	3_asm.ctg	4_asm.ctg
# contigs	26	23	18	17	14
Largest contig	500519	1040768	671836	803212	1099298
Total length	4663672	4668811	4688223	4693002	4703352
N50	245130	484760	462620	508833	544950
Misassemblies
# misassemblies	8	10	13	9	11
Misassembled contigs length	708152	2407429	1389781	1480741	2433888
Mismatches
# mismatches per 100 kbp	8.76	8.5	8.99	9.48	7.01
# indels per 100 kbp	2.57	1.64	2.03	0.97	0.67
# N's per 100 kbp	0.09	0.04	0.02	0.02	0
Genome statistics
Genome fraction (%)	99.895	99.596	99.746	99.548	99.931
Duplication ratio	1.006	1.011	1.014	1.016	1.014
# genes	4478 + 16 part	4474 + 12 part	4477 + 9 part	4470 + 8 part	4489 + 6 part
NGA50	187140	407626	275303	477000	526088

The PBcR were filtered to 25X and then assembled by runCA.

Name	Two SMRT cells	Three SMRT cells	Four SMRT cells
genomeSize=4650000, 25X	seqs amount:40382	seqs amount:24448	seqs amount:21641
	seq avg len:2878.787529	seq avg len:4754.996196	seq avg len:5371.762719
	total:116.25 Mb	total:116.25 Mb	total:116.25 Mb
	depth: 25.00X	depth: 25.00X	depth: 25.00X

Statistics without reference	2_25X_asm.ctg	3_25X_asm.ctg	4_25X_asm.ctg
# contigs	50	44	42
Largest contig	1214624	1487482	2034167
Total length	4914251	4967382	4937523
N50	672326	1340503	907105
Misassemblies
# misassemblies	16	25	26
Misassembled contigs length	1470460	2166145	3149834
Mismatches
# mismatches per 100 kbp	7.41	5.99	6.14
# indels per 100 kbp	1.47	0.71	2.76
# N's per 100 kbp	0	0.1	0.06
Genome statistics
Genome fraction (%)	100	100	99.995
Duplication ratio	1.061	1.071	1.066
# genes	4492 + 5 part	4495 + 2 part	4495 + 2 part
NGA50	367128	1340503	562428

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference	2_25X_asm.ctg	3_25X_asm.ctg	4_25X_asm.ctg
# contigs	17	6	10
Largest contig	1214624	1487482	2034167
Total length	4691318	4671508	4678847
N50	672326	1340503	907105
Misassemblies
# misassemblies	10	6	10
Misassembled contigs length	1424020	2055756	3075921
Mismatches
# mismatches per 100 kbp	7.13	6.05	6.53
# indels per 100 kbp	1.55	0.67	1.27
# N's per 100 kbp	0	0.06	0
Genome statistics
Genome fraction (%)	99.997	99.824	99.961
Duplication ratio	1.011	1.009	1.01
# genes	4488 + 9 part	4485 + 4 part	4491 + 4 part
NGA50	315240	1008667	572351

Summary

If you have high-coverage PacBio RS data (>50X), you should specify genomeSize in running pacBioToCA.
If you have high-coverage PBcR, you should filter 25X of the longest post-correction sequences for assembly.
Although we have high-coverage PacBio RS data (four SMRT cells, ~70X subreads) and follow the procedure (ref) to process the data, we did not get complete genome assemblies.