As described in the paper Hybrid error correction, second-generation data can be used to correct PacBio reads and then perform de novo assembly using PacBio corrected reads (PBcR). Here, we discuss the effects of depths on (1) hybrid error correction and (2) assembly.
Contents |
---|
pacBioToCA -length 500 -partitions 200 -l PacBio_Illumia -s pacbio.spec
The file of pacbio.spec was downloaded from PacBioToCA and corrected to pacbio.spec (下載).
We have used three short read depths (Raw, 118X and 100X) to correct long reads.
We have used subreads of 1-4 SMRT cells for different depths of long reads.
We arbitrary chose 1-4 SMRT cells:
Three single SMRT cell: m120208_071634, m120228_192221, m120228_210845
Two SMRT cells: m120228_210845 + m120208_122534
Three SMRT cells: m120228_115504 + m120228_152936 + m120228_100807
Four SMRT cells: m120228_171636 + m120228_223624 + m120228_100807 + m120228_190630
Name | m120208_071634 | m120228_192221 | m120228_210845 | Two SMRT cells | Three SMRT cells | Four SMRT cells |
seqs amount:37077 | seqs amount:38542 | seqs amount:44794 | seqs amount:77117 | seqs amount:113284 | seqs amount:136333 | |
seq avg len:2023.338161 | seq avg len:2322.679985 | seq avg len:2334.414140 | seq avg len:2184.208709 | seq avg len:2333.977711 | seq avg len:2386.664674 | |
total:75.02 Mb | total:89.52 Mb | total:104.57 Mb | total:168.44 Mb | total:264.40 Mb | total:325.38 Mb | |
depth: 16.13X | depth: 19.25X | depth: 22.49X | depth: 36.22X | depth: 56.86X | depth: 69.97X | |
Corrected by raw data | ||||||
seqs amount:26492 | seqs amount:34981 | seqs amount:40666 | seqs amount:63760 | seqs amount:98165 | seqs amount:118901 | |
seq avg len:2352.489884 | seq avg len:2133.783826 | seq avg len:2124.597010 | seq avg len:2199.845561 | seq avg len:2286.482249 | seq avg len:2320.548322 | |
total:62.32 Mb | total:74.64 Mb | total:86.40 Mb | total:140.26 Mb | total:224.45 Mb | total:275.92 Mb | |
depth: 13.40X | depth: 16.05X | depth: 18.58X | depth: 30.16X | depth: 48.27X | depth: 59.34X | |
Corrected by 118X | ||||||
seqs amount:26666 | seqs amount:35199 | seqs amount:40811 | seqs amount:64201 | seqs amount:99285 | seqs amount:120296 | |
seq avg len:2309.110290 | seq avg len:2095.143186 | seq avg len:2086.568670 | seq avg len:2150.165184 | seq avg len:2221.782394 | seq avg len:2252.656963 | |
total:61.57 Mb | total:73.75 Mb | total:85.15 Mb | total:138.04 Mb | total:220.59 Mb | total:270.99 Mb | |
depth: 13.24X | depth: 15.86X | depth: 18.31X | depth: 29.69X | depth: 47.44X | depth: 58.28X | |
Corrected by 100X | ||||||
seqs amount:25618 | seqs amount:61415 | seqs amount:95240 | seqs amount:115080 | |||
seq avg len:2315.355024 | seq avg len:2165.060164 | seq avg len:2247.193879 | seq avg len:2283.976060 | |||
total:59.31 Mb | total:132.97 Mb | total:214.02 Mb | total:262.84 Mb | |||
depth: 12.76X | depth: 28.60X | depth: 46.03X | depth: 56.52X |
After read correction, PBcR can be used to de novo assemble the genome using runCA or Mira3.
We have assembled the genome with the all PBcR and the filtered PBcR (25X, using gatekeeper) by runCA.
runCA -p asm -d asm -s asm.spec PacBio_Illumia.frg > asm.out 2>&1
We have evaluated the assemblies with QUAST 2.2(reference genome and genes下載).
Single SMRT cell reads were corrected with raw, 100X and 118X short reads.
Statistics without reference | 071634_raw_asm.ctg | 192221_raw_asm.ctg | 210845_raw_asm.ctg | 071634_100X_asm.ctg | 071634_118X_asm.ctg | 192221_118X_asm.ctg | 210845_118X_asm.ctg |
# contigs | 80 | 93 | 83 | 61 | 69 | 68 | 66 |
Largest contig | 745120 | 664876 | 562203 | 663399 | 434084 | 345313 | 437164 |
Total length | 4975695 | 5031560 | 5043217 | 4804004 | 4805579 | 4801310 | 4733683 |
N50 | 356974 | 221472 | 324225 | 295449 | 179662 | 207976 | 186993 |
Misassemblies | |||||||
# misassemblies | 11 | 17 | 21 | 10 | 13 | 20 | 15 |
Misassembled contigs length | 1552524 | 976207 | 2108892 | 1222277 | 782726 | 1156917 | 527873 |
Mismatches | |||||||
# mismatches per 100 kbp | 3.32 | 2.91 | 3.06 | 7.08 | 6.4 | 10.33 | 4.13 |
# indels per 100 kbp | 2.98 | 1.38 | 1.01 | 13.15 | 5.2 | 5.54 | 2.69 |
# N's per 100 kbp | 0.38 | 0.12 | 0.22 | 0.4 | 0.37 | 0.4 | 0.23 |
Genome statistics | |||||||
Genome fraction (%) | 99.97 | 100 | 100 | 99.304 | 99.424 | 99.522 | 98.712 |
Duplication ratio | 1.074 | 1.086 | 1.090 | 1.043 | 1.047 | 1.04 | 1.033 |
# genes | 4489 + 7 part | 4490 + 7 part | 4495 + 2 part | 4461 + 25 part | 4451 + 31 part | 4459 + 28 part | 4412 + 32 part |
NGA50 | 357183 | 221098 | 279423 | 226118 | 179662 | 194634 | 191457 |
Two SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.
Statistics without reference | 2_raw_asm.ctg | 2_raw_25X_asm.ctg | 2_100X_asm.ctg | 2_100X_25X_asm.ctg | 2_118X_asm.ctg | 2_118X_25X_asm.ctg |
# contigs | 106 | 80 | 81 | 71 | 81 | 54 |
Largest contig | 762045 | 757702 | 767781 | 405645 | 520095 | 570062 |
Total length | 5168000 | 5080370 | 4918962 | 4799524 | 4832961 | 4725680 |
N50 | 419161 | 405539 | 331262 | 193986 | 186504 | 210927 |
Misassemblies | ||||||
# misassemblies | 13 | 18 | 15 | 9 | 16 | 16 |
Misassembled contigs length | 1591856 | 1751860 | 1703983 | 165469 | 616747 | 1468075 |
Mismatches | ||||||
# mismatches per 100 kbp | 2.44 | 1.83 | 5.08 | 4.34 | 6.28 | 6.08 |
# indels per 100 kbp | 0.88 | 0.82 | 6.34 | 2.14 | 5.61 | 2.93 |
# N's per 100 kbp | 0.48 | 0.04 | 0.94 | 0.02 | 0.43 | 0.08 |
Genome statistics | ||||||
Genome fraction (%) | 100 | 100 | 99.652 | 98.76 | 99.567 | 99.194 |
Duplication ratio | 1.116 | 1.098 | 1.065 | 1.048 | 1.047 | 1.028 |
# genes | 4495 + 2 part | 4495 + 2 part | 4475 + 16 part | 4432 + 31 part | 4458 + 30 part | 4434 + 43 part |
NGA50 | 418393 | 405538 | 235822 | 193833 | 194196 | 199657 |
Statistics without reference | 2_raw_asm.ctg | 2_raw_25X_asm.ctg | 2_100X_asm.ctg | 2_100X_25X_asm.ctg | 2_118X_asm.ctg | 2_118X_25X_asm.ctg |
# contigs | 16 | 17 | 22 | 33 | 35 | 32 |
Largest contig | 762045 | 757702 | 767781 | 405645 | 520095 | 570062 |
Total length | 4650035 | 4675233 | 4651814 | 4574523 | 4648591 | 4588060 |
N50 | 514903 | 405539 | 331262 | 193986 | 194625 | 223426 |
Misassemblies | ||||||
# misassemblies | 4 | 6 | 10 | 4 | 5 | 8 |
Misassembled contigs length | 1564680 | 1697372 | 1683620 | 141677 | 569893 | 1424613 |
Mismatches | ||||||
# mismatches per 100 kbp | 2.43 | 2.23 | 4.94 | 1.8 | 5.56 | 5.97 |
# indels per 100 kbp | 1.76 | 0..84 | 6.19 | 1.8 | 4.48 | 2.86 |
# N's per 100 kbp | 0.06 | 0 | 0.26 | 0 | 0.13 | 0.04 |
Genome statistics | ||||||
Genome fraction (%) | 99.371 | 99.633 | 99.536 | 98.364 | 99.163 | 98.603 |
Duplication ratio | 1.006 | 1.013 | 1.009 | 1.003 | 1.011 | 1.004 |
# genes | 4458 + 13 part | 4466 + 8 part | 4462 + 22 part | 4404 + 39 part | 4438 + 32 part | 4405 + 42 part |
NGA50 | 418393 | 405538 | 235822 | 193833 | 194196 | 199657 |
Three SMRT cells reads were corrected with raw, 100X, and 118 short reads. The PBcR were then filtered to 25X or directly assembled by runCA.
Statistics without reference | 3_raw_asm.ctg | 3_raw_25X_asm.ctg | 3_100X_asm.ctg | 3_100X_25X_asm.ctg | 3_118X_asm.ctg | 3_118X_25X_asm.ctg |
# contigs | 219 | 74 | 98 | 32 | 86 | 39 |
Largest contig | 771076 | 1426293 | 981874 | 822480 | 1091515 | 520962 |
Total length | 5873961 | 5171438 | 5051244 | 4730819 | 4906749 | 4668968 |
N50 | 247798 | 317846 | 413464 | 600008 | 286035 | 218547 |
Misassemblies | ||||||
# misassemblies | 25 | 10 | 22 | 8 | 25 | 11 |
Misassembled contigs length | 1361077 | 1372143 | 2201123 | 1855654 | 1800186 | 1350243 |
Mismatches | ||||||
# mismatches per 100 kbp | 4.03 | 2.5 | 2.030 | 1.5 | 4.71 | 5.45 |
# indels per 100 kbp | 1.68 | 0.97 | 5.64 | 3.3 | 4.13 | 3.86 |
# N's per 100 kbp | 0.34 | 0.140 | 0.46 | 0.11 | 0.18 | 0.02 |
Genome statistics | ||||||
Genome fraction (%) | 100 | 100 | 99.733 | 99.197 | 99.69 | 98.93 |
Duplication ratio | 1.268 | 1.116 | 1.092 | 1.028 | 1.063 | 1.018 |
# genes | 4494 + 3 part | 4495 + 2 part | 4484 + 9 part | 4460 + 19 part | 4468 + 20 part | 4427 + 35 part |
NGA50 | 286997 | 323732 | 348693 | 599239 | 286035 | 193471 |
Four SMRT cell reads were corrected with raw, 100X, and 118X short reads. The PBcR were then filtered to 25X or directly assembled by runCA.
Statistics without reference | 4_raw_asm.ctg | 4_raw_25X_asm.ctg | 4_100X_asm.ctg | 4_100X_25X_asm.ctg | 4_118X_asm.ctg | 4_118X_25X_asm.ctg |
# contigs | 286 | 51 | 123 | 23 | 71 | 40 |
Largest contig | 532128 | 1812746 | 688723 | 1257198 | 983533 | 621920 |
Total length | 6162978 | 5045811 | 5144868 | 4693193 | 4862387 | 4665855 |
N50 | 147254 | 834736 | 398131 | 694380 | 412226 | 285200 |
Misassemblies | ||||||
# misassemblies | 24 | 13 | 26 | 8 | 31 | 13 |
Misassembled contigs length | 800651 | 3633076 | 2341550 | 2708632 | 2412628 | 1302367 |
Mismatches | ||||||
# mismatches per 100 kbp | 3.41 | 2.240 | 5.060 | 1.93 | 4.45 | 7.3 |
# indels per 100 kbp | 1.36 | 0.97 | 5.21 | 3.82 | 4.08 | 5.71 |
# N's per 100 kbp | 1.2 | 0.16 | 0.39 | 0.04 | 0.41 | 0 |
Genome statistics | ||||||
Genome fraction (%) | 100 | 100 | 99.74 | 99.337 | 99.798 | 98.559 |
Duplication ratio | 1.331 | 1.089 | 1.112 | 1.019 | 1.052 | 1.022 |
# genes | 4495 + 2 part | 4494 + 3 part | 4482 + 10 part | 4470 + 9 part | 4481 + 12 part | 4413 + 36 part |
NGA50 | 182232 | 476726 | 317322 | 694380 | 286770 | 214828 |