Read Depths

Revision as of 19 March 2014 01:29 by admin (Comments | Contribs) | (→Evaluation)

(diff) ← Previous revision | Current revision | Next revision → (diff)

As described in the paper Hybrid error correction, second-generation data can be used to correct PacBio reads and then perform de novo assembly using PacBio corrected reads (PBcR). Here, we discuss the effects of depths on (1) hybrid error correction and (2) assembly.

Contents [hide]
1 Hybrid error correction 1.1 Short read depth 1.2 Long read depth 1.3 Performance 2 Assembly 3 Evaluation 4 Summary

Hybrid error correction

pacBioToCA -length 500 -partitions 200 -l PacBio_Illumia -s pacbio.spec

The file of pacbio.spec was downloaded from PacBioToCA and corrected to pacbio.spec (下載).

Short read depth

We have used three short read depths (Raw, 118X and 100X) to correct long reads.

Long read depth

We have used subreads of 1-4 SMRT cells for different depths of long reads.

We arbitrary chose 1-4 SMRT cells:
Three single SMRT cell: m120208_071634, m120228_192221, m120228_210845
Two SMRT cells: m120228_210845 + m120208_122534
Three SMRT cells: m120228_115504 + m120228_152936 + m120228_100807
Four SMRT cells: m120228_171636 + m120228_223624 + m120228_100807 + m120228_190630

Performance

Name	m120208_071634	m120228_192221	m120228_210845	Two SMRT cells	Three SMRT cells	Four SMRT cells
	seqs amount:37077	seqs amount:38542	seqs amount:44794	seqs amount:77117	seqs amount:113284	seqs amount:136333
	seq avg len:2023.338161	seq avg len:2322.679985	seq avg len:2334.414140	seq avg len:2184.208709	seq avg len:2333.977711	seq avg len:2386.664674
	total:75.02 Mb	total:89.52 Mb	total:104.57 Mb	total:168.44 Mb	total:264.40 Mb	total:325.38 Mb
	depth: 16.13X	depth: 19.25X	depth: 22.49X	depth: 36.22X	depth: 56.86X	depth: 69.97X
Corrected by raw data
	seqs amount:26492	seqs amount:34981	seqs amount:40666	seqs amount:63760	seqs amount:98165	seqs amount:118901
	seq avg len:2352.489884	seq avg len:2133.783826	seq avg len:2124.597010	seq avg len:2199.845561	seq avg len:2286.482249	seq avg len:2320.548322
	total:62.32 Mb	total:74.64 Mb	total:86.40 Mb	total:140.26 Mb	total:224.45 Mb	total:275.92 Mb
	depth: 13.40X	depth: 16.05X	depth: 18.58X	depth: 30.16X	depth: 48.27X	depth: 59.34X
Corrected by 118X
	seqs amount:26666	seqs amount:35199	seqs amount:40811	seqs amount:64201	seqs amount:99285	seqs amount:120296
	seq avg len:2309.110290	seq avg len:2095.143186	seq avg len:2086.568670	seq avg len:2150.165184	seq avg len:2221.782394	seq avg len:2252.656963
	total:61.57 Mb	total:73.75 Mb	total:85.15 Mb	total:138.04 Mb	total:220.59 Mb	total:270.99 Mb
	depth: 13.24X	depth: 15.86X	depth: 18.31X	depth: 29.69X	depth: 47.44X	depth: 58.28X
Corrected by 100X
	seqs amount:25618			seqs amount:61415	seqs amount:95240	seqs amount:115080
	seq avg len:2315.355024			seq avg len:2165.060164	seq avg len:2247.193879	seq avg len:2283.976060
	total:59.31 Mb			total:132.97 Mb	total:214.02 Mb	total:262.84 Mb
	depth: 12.76X			depth: 28.60X	depth: 46.03X	depth: 56.52X

Assembly

After read correction, PBcR can be used to de novo assemble the genome using runCA or Mira3.

We have assembled the genome with the all PBcR and the filtered PBcR (25X, using gatekeeper) by runCA.

runCA -p asm -d asm -s asm.spec PacBio_Illumia.frg > asm.out 2>&1

Evaluation

We have evaluated the assemblies with QUAST 2.2(reference genome NC_000913 and Ec_gene_list).

Single SMRT cell reads were corrected with raw, 100X and 118X short reads.

Statistics without reference	071634_raw_asm.ctg	192221_raw_asm.ctg	210845_raw_asm.ctg	071634_100X_asm.ctg	071634_118X_asm.ctg	192221_118X_asm.ctg	210845_118X_asm.ctg
# contigs	80	93	83	61	69	68	66
Largest contig	745120	664876	562203	663399	434084	345313	437164
Total length	4975695	5031560	5043217	4804004	4805579	4801310	4733683
N50	356974	221472	324225	295449	179662	207976	186993
Misassemblies
# misassemblies	11	17	21	10	13	20	15
Misassembled contigs length	1552524	976207	2108892	1222277	782726	1156917	527873
Mismatches
# mismatches per 100 kbp	3.32	2.91	3.06	7.08	6.4	10.33	4.13
# indels per 100 kbp	2.98	1.38	1.01	13.15	5.2	5.54	2.69
# N's per 100 kbp	0.38	0.12	0.22	0.4	0.37	0.4	0.23
Genome statistics
Genome fraction (%)	99.97	100	100	99.304	99.424	99.522	98.712
Duplication ratio	1.074	1.086	1.090	1.043	1.047	1.04	1.033
# genes	4489 + 7 part	4490 + 7 part	4495 + 2 part	4461 + 25 part	4451 + 31 part	4459 + 28 part	4412 + 32 part
NGA50	357183	221098	279423	226118	179662	194634	191457

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference	071634_raw_asm.ctg	192221_raw_asm.ctg	210845_raw_asm.ctg	071634_100X_asm.ctg	071634_118X_asm.ctg	192221_118X_asm.ctg	210845_118X_asm.ctg
# contigs	19	24	21	28	38	29	31
Largest contig	745120	664876	592203	663399	434084	345313	437164
Total length	4669108	4675696	4700617	4636263	4644391	4603072	4578972
N50	356974	222559	399011	295449	180706	207976	191458
Misassemblies
# misassemblies	7	6	11	6	5	7	6
Misassembled contigs length	1539749	936587	2058922	1200212	727024	1097466	478971
Mismatches
# mismatches per 100 kbp	2.75	2.75	3.04	7.08	5.85	8.82	3.69
# indels per 100 kbp	2.23	1.1	1.17	13.46	5.83	2.49	2.37
# N's per 100 kbp	0.19	0.02	0.04	0.26	0.15	0.07	0.02
Genome statistics
Genome fraction (%)	99.639	99.699	99.834	99.984	99.051	98.78	98.159
Duplication ratio	1.011	1.011	1.017	1.01	1.015	1.005	1.006
# genes	4473 + 15 part	4465 + 18 part	4480 + 10 part	4435 + 29 part	4431 + 36 part	4413 + 36 part	4380 + 34 part
NGA50	357183	221098	279423	226118	179662	194634	191457

Two SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference	2_raw_asm.ctg	2_raw_25X_asm.ctg	2_100X_asm.ctg	2_100X_25X_asm.ctg	2_118X_asm.ctg	2_118X_25X_asm.ctg
# contigs	106	80	81	71	81	54
Largest contig	762045	757702	767781	405645	520095	570062
Total length	5168000	5080370	4918962	4799524	4832961	4725680
N50	419161	405539	331262	193986	186504	210927
Misassemblies
# misassemblies	13	18	15	9	16	16
Misassembled contigs length	1591856	1751860	1703983	165469	616747	1468075
Mismatches
# mismatches per 100 kbp	2.44	1.83	5.08	4.34	6.28	6.08
# indels per 100 kbp	0.88	0.82	6.34	2.14	5.61	2.93
# N's per 100 kbp	0.48	0.04	0.94	0.02	0.43	0.08
Genome statistics
Genome fraction (%)	100	100	99.652	98.76	99.567	99.194
Duplication ratio	1.116	1.098	1.065	1.048	1.047	1.028
# genes	4495 + 2 part	4495 + 2 part	4475 + 16 part	4432 + 31 part	4458 + 30 part	4434 + 43 part
NGA50	418393	405538	235822	193833	194196	199657

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference	2_raw_asm.ctg	2_raw_25X_asm.ctg	2_100X_asm.ctg	2_100X_25X_asm.ctg	2_118X_asm.ctg	2_118X_25X_asm.ctg
# contigs	16	17	22	33	35	32
Largest contig	762045	757702	767781	405645	520095	570062
Total length	4650035	4675233	4651814	4574523	4648591	4588060
N50	514903	405539	331262	193986	194625	223426
Misassemblies
# misassemblies	4	6	10	4	5	8
Misassembled contigs length	1564680	1697372	1683620	141677	569893	1424613
Mismatches
# mismatches per 100 kbp	2.43	2.23	4.94	1.8	5.56	5.97
# indels per 100 kbp	1.76	0..84	6.19	1.8	4.48	2.86
# N's per 100 kbp	0.06	0	0.26	0	0.13	0.04
Genome statistics
Genome fraction (%)	99.371	99.633	99.536	98.364	99.163	98.603
Duplication ratio	1.006	1.013	1.009	1.003	1.011	1.004
# genes	4458 + 13 part	4466 + 8 part	4462 + 22 part	4404 + 39 part	4438 + 32 part	4405 + 42 part
NGA50	418393	405538	235822	193833	194196	199657

Three SMRT cells reads were corrected with raw, 100X, and 118 short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference	3_raw_asm.ctg	3_raw_25X_asm.ctg	3_100X_asm.ctg	3_100X_25X_asm.ctg	3_118X_asm.ctg	3_118X_25X_asm.ctg
# contigs	219	74	98	32	86	39
Largest contig	771076	1426293	981874	822480	1091515	520962
Total length	5873961	5171438	5051244	4730819	4906749	4668968
N50	247798	317846	413464	600008	286035	218547
Misassemblies
# misassemblies	25	10	22	8	25	11
Misassembled contigs length	1361077	1372143	2201123	1855654	1800186	1350243
Mismatches
# mismatches per 100 kbp	4.03	2.5	2.030	1.5	4.71	5.45
# indels per 100 kbp	1.68	0.97	5.64	3.3	4.13	3.86
# N's per 100 kbp	0.34	0.140	0.46	0.11	0.18	0.02
Genome statistics
Genome fraction (%)	100	100	99.733	99.197	99.69	98.93
Duplication ratio	1.268	1.116	1.092	1.028	1.063	1.018
# genes	4494 + 3 part	4495 + 2 part	4484 + 9 part	4460 + 19 part	4468 + 20 part	4427 + 35 part
NGA50	286997	323732	348693	599239	286035	193471

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference	3_raw_asm.ctg	3_raw_25X_asm.ctg	3_100X_asm.ctg	3_100X_25X_asm.ctg	3_118X_asm.ctg	3_118X_25X_asm.ctg
# contigs	29	15	20	18	27	29
Largest contig	771076	1426293	981876	822480	1091515	520962
Total length	4672216	4666874	4656670	4613610	4650236	4593856
N50	316212	323732	413464	600008	286035	218547
Misassemblies
# misassemblies	6	4	10	6	7	7
Misassembled contigs length	1270144	1321332	2130162	1844194	1736227	1326538
Mismatches
# mismatches per 100 kbp	2.07	2.14	1.67	1.48	4.35	5.09
# indels per 100 kbp	0.98	0.65	5.54	3.31	2.88	3.76
# N's per 100 kbp	0.06	0	0.17	0.02	0.04	0
Genome statistics
Genome fraction (%)	99.037	99.636	99.615	99.087	99.478	98.614
Duplication ratio	1.017	1.01	1.008	1.004	1.009	1.005
# genes	4439 + 24 part	4467 + 11 part	4472 + 19 part	4454 + 22 part	4457 + 21 part	4410 + 35 part
NGA50	286997	323732	348693	599239	286035	193471

Four SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference	4_raw_asm.ctg	4_raw_25X_asm.ctg	4_100X_asm.ctg	4_100X_25X_asm.ctg	4_118X_asm.ctg	4_118X_25X_asm.ctg
# contigs	286	51	123	23	71	40
Largest contig	532128	1812746	688723	1257198	983533	621920
Total length	6162978	5045811	5144868	4693193	4862387	4665855
N50	147254	834736	398131	694380	412226	285200
Misassemblies
# misassemblies	24	13	26	8	31	13
Misassembled contigs length	800651	3633076	2341550	2708632	2412628	1302367
Mismatches
# mismatches per 100 kbp	3.41	2.240	5.060	1.93	4.45	7.3
# indels per 100 kbp	1.36	0.97	5.21	3.82	4.08	5.71
# N's per 100 kbp	1.2	0.16	0.39	0.04	0.41	0
Genome statistics
Genome fraction (%)	100	100	99.74	99.337	99.798	98.559
Duplication ratio	1.331	1.089	1.112	1.019	1.052	1.022
# genes	4495 + 2 part	4494 + 3 part	4482 + 10 part	4470 + 9 part	4481 + 12 part	4413 + 36 part
NGA50	182232	476726	317322	694380	286770	214828

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference	4_raw_asm.ctg	4_raw_25X_asm.ctg	4_100X_asm.ctg	4_100X_25X_asm.ctg	4_118X_asm.ctg	4_118X_25X_asm.ctg
# contigs	40	12	21	12	21	26
Largest contig	532128	1812746	688723	1257198	983533	621920
Total length	4726973	4659487	4656544	4602600	4654299	4508804
N50	180844	834736	398131	1071366	412226	285200
Misassemblies
# misassemblies	7	8	8	6	15	9
Misassembled contigs length	736689	3595196	2252884	2698687	2362706	1274915
Mismatches
# mismatches per 100 kbp	2.35	2.28	2.53	2	4.41	6.26
# indels per 100 kbp	0.87	0.75	3.94	3.81	3.31	4.49
# N's per 100 kbp	0.17	0.13	0.02	0.04	0.04	0
Genome statistics
Genome fraction (%)	99.193	99.215	99.62	98.967	99.687	97.023
Duplication ratio	1.028	1.012	1.008	1.003	1.009	1.003
# genes	4443 + 26 part	4456 + 12 part	4474 + 17 part	4452 + 11 part	4474 + 15 part	4344 + 37 part
NGA50	182232	476726	317322	694380	286770	247828

Summary

Short read depth has minor influence on read correction but correction with high quality of short reads may improve assembly (small number of contig and without overestimating genome size).
After read correction, to filter PBcR with gatekeeper is able to provide good assembly.
An improved assembly is not always guaranteed as long read depth is increased (from one SMRT cell to four SMRT cells).