Read Depths

Revision as of 13 August 2013 20:50 by admin (Comments | Contribs) | (Evaluation)

As described in the paper Hybrid error correction, second-generation data can be used to correct PacBio reads and then perform de novo assembly using PacBio corrected reads (PBcR). Here, we discuss the effects of depths on (1) hybrid error correction and (2) assembly.

Contents

Hybrid error correction

pacBioToCA -length 500 -partitions 200 -l PacBio_Illumia -s pacbio.spec

The file of pacbio.spec was downloaded from PacBioToCA and corrected to pacbio.spec (下載).

Short read depth

We have used three short read depths (Raw, 118X and 100X) to correct long reads.

Long read depth

We have used subreads of 1-4 SMRT cells for different depths of long reads.

We arbitrary chose 1-4 SMRT cells:
Three single SMRT cell: m120208_071634, m120228_192221, m120228_210845
Two SMRT cells: m120228_210845 + m120208_122534
Three SMRT cells: m120228_115504 + m120228_152936 + m120228_100807
Four SMRT cells: m120228_171636 + m120228_223624 + m120228_100807 + m120228_190630

Performance

Name m120208_071634 m120228_192221 m120228_210845 Two SMRT cells Three SMRT cells Four SMRT cells
seqs amount:37077 seqs amount:38542 seqs amount:44794 seqs amount:77117 seqs amount:113284 seqs amount:136333
seq avg len:2023.338161 seq avg len:2322.679985 seq avg len:2334.414140 seq avg len:2184.208709 seq avg len:2333.977711 seq avg len:2386.664674
total:75.02 Mb total:89.52 Mb total:104.57 Mb total:168.44 Mb total:264.40 Mb total:325.38 Mb
depth: 16.13X depth: 19.25X depth: 22.49X depth: 36.22X depth: 56.86X depth: 69.97X
Corrected by raw data
seqs amount:26492 seqs amount:34981 seqs amount:40666 seqs amount:63760 seqs amount:98165 seqs amount:118901
seq avg len:2352.489884 seq avg len:2133.783826 seq avg len:2124.597010 seq avg len:2199.845561 seq avg len:2286.482249 seq avg len:2320.548322
total:62.32 Mb total:74.64 Mb total:86.40 Mb total:140.26 Mb total:224.45 Mb total:275.92 Mb
depth: 13.40X depth: 16.05X depth: 18.58X depth: 30.16X depth: 48.27X depth: 59.34X
Corrected by 118X
seqs amount:26666 seqs amount:64201 seqs amount:99285 seqs amount:120296
seq avg len:2309.110290 seq avg len:2150.165184 seq avg len:2221.782394 seq avg len:2252.656963
total:61.57 Mb total:138.04 Mb total:220.59 Mb total:270.99 Mb
depth: 13.24X depth: 29.69X depth: 47.44X depth: 58.28X
Corrected by 100X
seqs amount:25618 seqs amount:61415 seqs amount:95240 seqs amount:115080
seq avg len:2315.355024 seq avg len:2165.060164 seq avg len:2247.193879 seq avg len:2283.976060
total:59.31 Mb total:132.97 Mb total:214.02 Mb total:262.84 Mb
depth: 12.76X depth: 28.60X depth: 46.03X depth: 56.52X

Assembly

After read correction, PBcR can be used to de novo assemble the genome using runCA or Mira3.

We have assembled the genome with the all PBcR and the filtered PBcR (25X, using gatekeeper) by runCA.

runCA -p asm -d asm -s asm.spec PacBio_Illumia.frg > asm.out 2>&1

Evaluation

We have evaluated the assemblies with QUAST 2.2(reference genome and genes下載).

Single SMRT cell reads corrected with raw, 100X and 118X short reads.

Statistics without reference 071634_raw_asm.ctg 192221_raw_asm.ctg 210845_raw_asm.ctg 071634_100X_asm.ctg 071634_118X_asm.ctg
# contigs 80 93 83 61 69
Largest contig 745120 664876 562203 663399 434084
Total length 4975695 5031560 5043217 4804004 4805579
N50 356974 221472 324225 295449 179662
Misassemblies
# misassemblies 11 17 21 10 13
Misassembled contigs length 1552524 976207 2108892 1222277 782726
Mismatches
# mismatches per 100 kbp 3.32 2.91 3.06 7.08 6.4
# indels per 100 kbp 2.98 1.38 1.01 13.15 5.2
# N's per 100 kbp 0.38 0.12 0.22 0.4 0.37
Genome statistics
Genome fraction (%) 99.97 100 100 99.304 99.424
Duplication ratio 1.074 1.086 1.090 1.043 1.047
# genes 4489 + 7 part 4490 + 7 part 4495 + 2 part 4461 + 25 part 4451 + 31 part
NGA50 357183 221098 279423 226118 179662