Read Depths

Revision as of 19 March 2014 01:29 by admin (Comments | Contribs) | (Evaluation)

As described in the paper Hybrid error correction, second-generation data can be used to correct PacBio reads and then perform de novo assembly using PacBio corrected reads (PBcR). Here, we discuss the effects of depths on (1) hybrid error correction and (2) assembly.

Contents

Hybrid error correction

pacBioToCA -length 500 -partitions 200 -l PacBio_Illumia -s pacbio.spec

The file of pacbio.spec was downloaded from PacBioToCA and corrected to pacbio.spec (下載).

Short read depth

We have used three short read depths (Raw, 118X and 100X) to correct long reads.

Long read depth

We have used subreads of 1-4 SMRT cells for different depths of long reads.

We arbitrary chose 1-4 SMRT cells:
Three single SMRT cell: m120208_071634, m120228_192221, m120228_210845
Two SMRT cells: m120228_210845 + m120208_122534
Three SMRT cells: m120228_115504 + m120228_152936 + m120228_100807
Four SMRT cells: m120228_171636 + m120228_223624 + m120228_100807 + m120228_190630

Performance

Name m120208_071634 m120228_192221 m120228_210845 Two SMRT cells Three SMRT cells Four SMRT cells
seqs amount:37077 seqs amount:38542 seqs amount:44794 seqs amount:77117 seqs amount:113284 seqs amount:136333
seq avg len:2023.338161 seq avg len:2322.679985 seq avg len:2334.414140 seq avg len:2184.208709 seq avg len:2333.977711 seq avg len:2386.664674
total:75.02 Mb total:89.52 Mb total:104.57 Mb total:168.44 Mb total:264.40 Mb total:325.38 Mb
depth: 16.13X depth: 19.25X depth: 22.49X depth: 36.22X depth: 56.86X depth: 69.97X
Corrected by raw data
seqs amount:26492 seqs amount:34981 seqs amount:40666 seqs amount:63760 seqs amount:98165 seqs amount:118901
seq avg len:2352.489884 seq avg len:2133.783826 seq avg len:2124.597010 seq avg len:2199.845561 seq avg len:2286.482249 seq avg len:2320.548322
total:62.32 Mb total:74.64 Mb total:86.40 Mb total:140.26 Mb total:224.45 Mb total:275.92 Mb
depth: 13.40X depth: 16.05X depth: 18.58X depth: 30.16X depth: 48.27X depth: 59.34X
Corrected by 118X
seqs amount:26666 seqs amount:35199 seqs amount:40811 seqs amount:64201 seqs amount:99285 seqs amount:120296
seq avg len:2309.110290 seq avg len:2095.143186 seq avg len:2086.568670 seq avg len:2150.165184 seq avg len:2221.782394 seq avg len:2252.656963
total:61.57 Mb total:73.75 Mb total:85.15 Mb total:138.04 Mb total:220.59 Mb total:270.99 Mb
depth: 13.24X depth: 15.86X depth: 18.31X depth: 29.69X depth: 47.44X depth: 58.28X
Corrected by 100X
seqs amount:25618 seqs amount:61415 seqs amount:95240 seqs amount:115080
seq avg len:2315.355024 seq avg len:2165.060164 seq avg len:2247.193879 seq avg len:2283.976060
total:59.31 Mb total:132.97 Mb total:214.02 Mb total:262.84 Mb
depth: 12.76X depth: 28.60X depth: 46.03X depth: 56.52X

Assembly

After read correction, PBcR can be used to de novo assemble the genome using runCA or Mira3.

We have assembled the genome with the all PBcR and the filtered PBcR (25X, using gatekeeper) by runCA.

runCA -p asm -d asm -s asm.spec PacBio_Illumia.frg > asm.out 2>&1

Evaluation

We have evaluated the assemblies with QUAST 2.2(reference genome NC_000913 and Ec_gene_list).

Single SMRT cell reads were corrected with raw, 100X and 118X short reads.

Statistics without reference 071634_raw_asm.ctg 192221_raw_asm.ctg 210845_raw_asm.ctg 071634_100X_asm.ctg 071634_118X_asm.ctg 192221_118X_asm.ctg 210845_118X_asm.ctg
# contigs 80 93 83 61 69 68 66
Largest contig 745120 664876 562203 663399 434084 345313 437164
Total length 4975695 5031560 5043217 4804004 4805579 4801310 4733683
N50 356974 221472 324225 295449 179662 207976 186993
Misassemblies
# misassemblies 11 17 21 10 13 20 15
Misassembled contigs length 1552524 976207 2108892 1222277 782726 1156917 527873
Mismatches
# mismatches per 100 kbp 3.32 2.91 3.06 7.08 6.4 10.33 4.13
# indels per 100 kbp 2.98 1.38 1.01 13.15 5.2 5.54 2.69
# N's per 100 kbp 0.38 0.12 0.22 0.4 0.37 0.4 0.23
Genome statistics
Genome fraction (%) 99.97 100 100 99.304 99.424 99.522 98.712
Duplication ratio 1.074 1.086 1.090 1.043 1.047 1.04 1.033
# genes 4489 + 7 part 4490 + 7 part 4495 + 2 part 4461 + 25 part 4451 + 31 part 4459 + 28 part 4412 + 32 part
NGA50 357183 221098 279423 226118 179662 194634 191457

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference 071634_raw_asm.ctg 192221_raw_asm.ctg 210845_raw_asm.ctg 071634_100X_asm.ctg 071634_118X_asm.ctg 192221_118X_asm.ctg 210845_118X_asm.ctg
# contigs 19 24 21 28 38 29 31
Largest contig 745120 664876 592203 663399 434084 345313 437164
Total length 4669108 4675696 4700617 4636263 4644391 4603072 4578972
N50 356974 222559 399011 295449 180706 207976 191458
Misassemblies
# misassemblies 7 6 11 6 5 7 6
Misassembled contigs length 1539749 936587 2058922 1200212 727024 1097466 478971
Mismatches
# mismatches per 100 kbp 2.75 2.75 3.04 7.08 5.85 8.82 3.69
# indels per 100 kbp 2.23 1.1 1.17 13.46 5.83 2.49 2.37
# N's per 100 kbp 0.19 0.02 0.04 0.26 0.15 0.07 0.02
Genome statistics
Genome fraction (%) 99.639 99.699 99.834 99.984 99.051 98.78 98.159
Duplication ratio 1.011 1.011 1.017 1.01 1.015 1.005 1.006
# genes 4473 + 15 part 4465 + 18 part 4480 + 10 part 4435 + 29 part 4431 + 36 part 4413 + 36 part 4380 + 34 part
NGA50 357183 221098 279423 226118 179662 194634 191457

Two SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference 2_raw_asm.ctg 2_raw_25X_asm.ctg 2_100X_asm.ctg 2_100X_25X_asm.ctg 2_118X_asm.ctg 2_118X_25X_asm.ctg
# contigs 106 80 81 71 81 54
Largest contig 762045 757702 767781 405645 520095 570062
Total length 5168000 5080370 4918962 4799524 4832961 4725680
N50 419161 405539 331262 193986 186504 210927
Misassemblies
# misassemblies 13 18 15 9 16 16
Misassembled contigs length 1591856 1751860 1703983 165469 616747 1468075
Mismatches
# mismatches per 100 kbp 2.44 1.83 5.08 4.34 6.28 6.08
# indels per 100 kbp 0.88 0.82 6.34 2.14 5.61 2.93
# N's per 100 kbp 0.48 0.04 0.94 0.02 0.43 0.08
Genome statistics
Genome fraction (%) 100 100 99.652 98.76 99.567 99.194
Duplication ratio 1.116 1.098 1.065 1.048 1.047 1.028
# genes 4495 + 2 part 4495 + 2 part 4475 + 16 part 4432 + 31 part 4458 + 30 part 4434 + 43 part
NGA50 418393 405538 235822 193833 194196 199657

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference 2_raw_asm.ctg 2_raw_25X_asm.ctg 2_100X_asm.ctg 2_100X_25X_asm.ctg 2_118X_asm.ctg 2_118X_25X_asm.ctg
# contigs 16 17 22 33 35 32
Largest contig 762045 757702 767781 405645 520095 570062
Total length 4650035 4675233 4651814 4574523 4648591 4588060
N50 514903 405539 331262 193986 194625 223426
Misassemblies
# misassemblies 4 6 10 4 5 8
Misassembled contigs length 1564680 1697372 1683620 141677 569893 1424613
Mismatches
# mismatches per 100 kbp 2.43 2.23 4.94 1.8 5.56 5.97
# indels per 100 kbp 1.76 0..84 6.19 1.8 4.48 2.86
# N's per 100 kbp 0.06 0 0.26 0 0.13 0.04
Genome statistics
Genome fraction (%) 99.371 99.633 99.536 98.364 99.163 98.603
Duplication ratio 1.006 1.013 1.009 1.003 1.011 1.004
# genes 4458 + 13 part 4466 + 8 part 4462 + 22 part 4404 + 39 part 4438 + 32 part 4405 + 42 part
NGA50 418393 405538 235822 193833 194196 199657

Three SMRT cells reads were corrected with raw, 100X, and 118 short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference 3_raw_asm.ctg 3_raw_25X_asm.ctg 3_100X_asm.ctg 3_100X_25X_asm.ctg 3_118X_asm.ctg 3_118X_25X_asm.ctg
# contigs 219 74 98 32 86 39
Largest contig 771076 1426293 981874 822480 1091515 520962
Total length 5873961 5171438 5051244 4730819 4906749 4668968
N50 247798 317846 413464 600008 286035 218547
Misassemblies
# misassemblies 25 10 22 8 25 11
Misassembled contigs length 1361077 1372143 2201123 1855654 1800186 1350243
Mismatches
# mismatches per 100 kbp 4.03 2.5 2.030 1.5 4.71 5.45
# indels per 100 kbp 1.68 0.97 5.64 3.3 4.13 3.86
# N's per 100 kbp 0.34 0.140 0.46 0.11 0.18 0.02
Genome statistics
Genome fraction (%) 100 100 99.733 99.197 99.69 98.93
Duplication ratio 1.268 1.116 1.092 1.028 1.063 1.018
# genes 4494 + 3 part 4495 + 2 part 4484 + 9 part 4460 + 19 part 4468 + 20 part 4427 + 35 part
NGA50 286997 323732 348693 599239 286035 193471

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference 3_raw_asm.ctg 3_raw_25X_asm.ctg 3_100X_asm.ctg 3_100X_25X_asm.ctg 3_118X_asm.ctg 3_118X_25X_asm.ctg
# contigs 29 15 20 18 27 29
Largest contig 771076 1426293 981876 822480 1091515 520962
Total length 4672216 4666874 4656670 4613610 4650236 4593856
N50 316212 323732 413464 600008 286035 218547
Misassemblies
# misassemblies 6 4 10 6 7 7
Misassembled contigs length 1270144 1321332 2130162 1844194 1736227 1326538
Mismatches
# mismatches per 100 kbp 2.07 2.14 1.67 1.48 4.35 5.09
# indels per 100 kbp 0.98 0.65 5.54 3.31 2.88 3.76
# N's per 100 kbp 0.06 0 0.17 0.02 0.04 0
Genome statistics
Genome fraction (%) 99.037 99.636 99.615 99.087 99.478 98.614
Duplication ratio 1.017 1.01 1.008 1.004 1.009 1.005
# genes 4439 + 24 part 4467 + 11 part 4472 + 19 part 4454 + 22 part 4457 + 21 part 4410 + 35 part
NGA50 286997 323732 348693 599239 286035 193471

Four SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.

Statistics without reference 4_raw_asm.ctg 4_raw_25X_asm.ctg 4_100X_asm.ctg 4_100X_25X_asm.ctg 4_118X_asm.ctg 4_118X_25X_asm.ctg
# contigs 286 51 123 23 71 40
Largest contig 532128 1812746 688723 1257198 983533 621920
Total length 6162978 5045811 5144868 4693193 4862387 4665855
N50 147254 834736 398131 694380 412226 285200
Misassemblies
# misassemblies 24 13 26 8 31 13
Misassembled contigs length 800651 3633076 2341550 2708632 2412628 1302367
Mismatches
# mismatches per 100 kbp 3.41 2.240 5.060 1.93 4.45 7.3
# indels per 100 kbp 1.36 0.97 5.21 3.82 4.08 5.71
# N's per 100 kbp 1.2 0.16 0.39 0.04 0.41 0
Genome statistics
Genome fraction (%) 100 100 99.74 99.337 99.798 98.559
Duplication ratio 1.331 1.089 1.112 1.019 1.052 1.022
# genes 4495 + 2 part 4494 + 3 part 4482 + 10 part 4470 + 9 part 4481 + 12 part 4413 + 36 part
NGA50 182232 476726 317322 694380 286770 214828

We discarded the contigs which fewer than 100 reads aligned.

Statistics without reference 4_raw_asm.ctg 4_raw_25X_asm.ctg 4_100X_asm.ctg 4_100X_25X_asm.ctg 4_118X_asm.ctg 4_118X_25X_asm.ctg
# contigs 40 12 21 12 21 26
Largest contig 532128 1812746 688723 1257198 983533 621920
Total length 4726973 4659487 4656544 4602600 4654299 4508804
N50 180844 834736 398131 1071366 412226 285200
Misassemblies
# misassemblies 7 8 8 6 15 9
Misassembled contigs length 736689 3595196 2252884 2698687 2362706 1274915
Mismatches
# mismatches per 100 kbp 2.35 2.28 2.53 2 4.41 6.26
# indels per 100 kbp 0.87 0.75 3.94 3.81 3.31 4.49
# N's per 100 kbp 0.17 0.13 0.02 0.04 0.04 0
Genome statistics
Genome fraction (%) 99.193 99.215 99.62 98.967 99.687 97.023
Duplication ratio 1.028 1.012 1.008 1.003 1.009 1.003
# genes 4443 + 26 part 4456 + 12 part 4474 + 17 part 4452 + 11 part 4474 + 15 part 4344 + 37 part
NGA50 182232 476726 317322 694380 286770 247828

Summary

  1. Short read depth has minor influence on read correction but correction with high quality of short reads may improve assembly (small number of contig and without overestimating genome size).
  2. After read correction, to filter PBcR with gatekeeper is able to provide good assembly.
  3. An improved assembly is not always guaranteed as long read depth is increased (from one SMRT cell to four SMRT cells).