RunCA

We re-run correction and assembly with the data provided in PBcR closure project.

We corrected the long read sequence data (200X) with illumina short reads (100X), with or without specifying genome size.

pacBioToCA -l viaMiseq -s pacbio.spec  -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 miseq.100X.frg.bz2

pacBioToCA -l viaMiseq -s pacbio.spec  -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 genomeSize=4650000 miseq.100X.frg.bz2

200X filtered long reads	Without genomeSize	genomeSize=4650000
seqs amount:383482	seqs amount:332880	seqs amount:37879
seq avg len:2422.877720	seq avg len:2260.68262	seq avg len:4927.683492
total:929.13 Mb	total:752.54 Mb	total:186.66 Mb
depth: 199.81X	depth: 161.84X	depth: 40.14X

In addition to filter 25X PBcR or no filter for assembly, we used different Celera Assembler parameters as described in ref.

runCA1: runCA -p asm -d asm -s asm.spec viaMiseq.frg

runCA2: runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg

We therefore had eight assemblies for comparison.

PacbioToCA	Gatekeeper	RunCA	Name of assembly
without genomeSize	all PBcR	runCA1	wo_all_runCA1
		runCA2	wo_all_runCA2
	25X PBcR	runCA1	wo_25X_runCA1
		runCA2	wo_25X_runCA2
genomeSize=4650000	all PBcR	runCA1	w_all_runCA1
		runCA2	w_all_runCA2
	25X PBcR	runCA1	w_25X_runCA1
		runCA2	w_25X_runCA2

Evaluation

We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).

Statistics without reference	wo_all_runCA1	wo_all_runCA2	wo_25X_runCA1	wo_25X_runCA2	w_all_runCA1	w_all_runCA2	w_25X_runCA1	w_25X_runCA2
# contigs	248	302	15	36	36	76	17	37
Largest contig	762617	578194	1341765	694845	2069213	1204169	2021284	1754178
Total length	5539907	6343154	4656826	4814122	4943194	5322102	4800259	5002926
N50	371897	142793	1155479	460790	1478089	496729	1215597	1253429
Misassemblies
# misassemblies	36	38	8	14	12	17	11	15
Misassembled contigs length	1443585	1419385	2506591	1302083	3703087	1183356	2022749	2142054
Mismatches
# mismatches per 100 kbp	2.81	1.150	0.7	0.76	3.66	3.88	4.310	4.33
# indels per 100 kbp	12.97	8.02	2.19	2.54	0.52	0.86	1.06	0.65
# N's per 100 kbp	1.32	0.38	0	0	0	0.15	0.02	0.2
Genome statistics
Genome fraction (%)	99.858	99.651	99.189	99.196	100	100	100	100
Duplication ratio	1.195	1.373	1.012	1.046	1.066	1.148	1.035	1.079
# genes	4484 + 10 part	4481 + 12 part	4466 + 11 part	4466 + 11 part	4494 + 3 part	4494 + 3 part	4494 + 3 part	4493 + 4 part
NGA50	411646	230980	694825	366166	670145	677677	877912	834940

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference	wo_all_runCA1	wo_all_runCA2	wo_25X_runCA1	wo_25X_runCA2	w_all_runCA1	w_all_runCA2	w_25X_runCA1	w_25X_runCA2
# contigs	34	56	11	19	7	16	5	7
Largest contig	762617	578194	1341765	694845	2069213	1204169	2021284	1754178
Total length	4798555	4933914	4612517	4629273	4635310	4744173	4666475	4690899
N50	412414	318728	1155479	498347	1478089	678445	1215597	1253429
Misassemblies
# misassemblies	9	8	8	12	10	8	8	8
Misassembled contigs length	1287728	1257103	2497244	1278777	3693737	1111214	1994551	2084097
Mismatches
# mismatches per 100 kbp	1.49	1.06	0.72	0.81	3.38	3.9	4.29	4.18
# indels per 100 kbp	11.04	7.93	2.17	2.35	0.39	0.86	0.5	0.52
# N's per 100 kbp	0.15	0.04	0	0	0	0.08	0.02	0
Genome statistics
Genome fraction (%)	99.769	99.515	99.189	98.957	99.561	99.985	99.944	100
Duplication ratio	1.037	1.07	1.002	1.008	1.003	1.023	1.006	1.011
# genes	4479 + 13 part	4469 + 14 part	4465 + 12 part	4454 + 15 part	4467 + 12 part	4491 + 6 part	4487 + 6 part	4492 + 5 part
NGA50	368011	230980	694825	340107	1392456	496729	768685	834919