We re-run correction and assembly with the data provided in PBcR closure project.
We corrected the long read sequence data (200X) with illumina short reads (100X), with or without specifying genome size.
pacBioToCA -l viaMiseq -s pacbio.spec -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 miseq.100X.frg.bz2
pacBioToCA -l viaMiseq -s pacbio.spec -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 genomeSize=4650000 miseq.100X.frg.bz2
200X filtered long reads | Without genomeSize | genomeSize=4650000 |
seqs amount:383482 | seqs amount:332880 | seqs amount:37879 |
seq avg len:2422.877720 | seq avg len:2260.68262 | seq avg len:4927.683492 |
total:929.13 Mb | total:752.54 Mb | total:186.66 Mb |
depth: 199.81X | depth: 161.84X | depth: 40.14X |
In addition to filter 25X PBcR or no filter for assembly, we used different Celera Assembler parameters as described in ref.
runCA1: runCA -p asm -d asm -s asm.spec viaMiseq.frg
runCA2: runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg
We therefore had eight assemblies for comparison.
PacbioToCA | Gatekeeper | RunCA | Name of assembly |
without genomeSize | all PBcR | runCA1 | wo_all_runCA1 |
runCA2 | wo_all_runCA2 | ||
25X PBcR | runCA1 | wo_25X_runCA1 | |
runCA2 | wo_25X_runCA2 | ||
genomeSize=4650000 | all PBcR | runCA1 | w_all_runCA1 |
runCA2 | w_all_runCA2 | ||
25X PBcR | runCA1 | w_25X_runCA1 | |
runCA2 | w_25X_runCA2 |
We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).
Statistics without reference | wo_all_runCA1 | wo_all_runCA2 | wo_25X_runCA1 | wo_25X_runCA2 | w_all_runCA1 | w_all_runCA2 | w_25X_runCA1 | w_25X_runCA2 |
# contigs | 248 | 302 | 15 | 36 | 36 | 76 | 17 | 37 |
Largest contig | 762617 | 578194 | 1341765 | 694845 | 2069213 | 1204169 | 2021284 | 1754178 |
Total length | 5539907 | 6343154 | 4656826 | 4814122 | 4943194 | 5322102 | 4800259 | 5002926 |
N50 | 371897 | 142793 | 1155479 | 460790 | 1478089 | 496729 | 1215597 | 1253429 |
Misassemblies | ||||||||
# misassemblies | 36 | 38 | 8 | 14 | 12 | 17 | 11 | 15 |
Misassembled contigs length | 1443585 | 1419385 | 2506591 | 1302083 | 3703087 | 1183356 | 2022749 | 2142054 |
Mismatches | ||||||||
# mismatches per 100 kbp | 2.81 | 1.150 | 0.7 | 0.76 | 3.66 | 3.88 | 4.310 | 4.33 |
# indels per 100 kbp | 12.97 | 8.02 | 2.19 | 2.54 | 0.52 | 0.86 | 1.06 | 0.65 |
# N's per 100 kbp | 1.32 | 0.38 | 0 | 0 | 0 | 0.15 | 0.02 | 0.2 |
Genome statistics | ||||||||
Genome fraction (%) | 99.858 | 99.651 | 99.189 | 99.196 | 100 | 100 | 100 | 100 |
Duplication ratio | 1.195 | 1.373 | 1.012 | 1.046 | 1.066 | 1.148 | 1.035 | 1.079 |
# genes | 4484 + 10 part | 4481 + 12 part | 4466 + 11 part | 4466 + 11 part | 4494 + 3 part | 4494 + 3 part | 4494 + 3 part | 4493 + 4 part |
NGA50 | 411646 | 230980 | 694825 | 366166 | 670145 | 677677 | 877912 | 834940 |
We discarded the contigs which fewer than 100 reads aligned. more detail
Statistics without reference | wo_all_runCA1 | wo_all_runCA2 | wo_25X_runCA1 | wo_25X_runCA2 | w_all_runCA1 | w_all_runCA2 | w_25X_runCA1 | w_25X_runCA2 |
# contigs | 34 | 56 | 11 | 19 | 7 | 16 | 5 | 7 |
Largest contig | 762617 | 578194 | 1341765 | 694845 | 2069213 | 1204169 | 2021284 | 1754178 |
Total length | 4798555 | 4933914 | 4612517 | 4629273 | 4635310 | 4744173 | 4666475 | 4690899 |
N50 | 412414 | 318728 | 1155479 | 498347 | 1478089 | 678445 | 1215597 | 1253429 |
Misassemblies | ||||||||
# misassemblies | 9 | 8 | 8 | 12 | 10 | 8 | 8 | 8 |
Misassembled contigs length | 1287728 | 1257103 | 2497244 | 1278777 | 3693737 | 1111214 | 1994551 | 2084097 |
Mismatches | ||||||||
# mismatches per 100 kbp | 1.49 | 1.06 | 0.72 | 0.81 | 3.38 | 3.9 | 4.29 | 4.18 |
# indels per 100 kbp | 11.04 | 7.93 | 2.17 | 2.35 | 0.39 | 0.86 | 0.5 | 0.52 |
# N's per 100 kbp | 0.15 | 0.04 | 0 | 0 | 0 | 0.08 | 0.02 | 0 |
Genome statistics | ||||||||
Genome fraction (%) | 99.769 | 99.515 | 99.189 | 98.957 | 99.561 | 99.985 | 99.944 | 100 |
Duplication ratio | 1.037 | 1.07 | 1.002 | 1.008 | 1.003 | 1.023 | 1.006 | 1.011 |
# genes | 4479 + 13 part | 4469 + 14 part | 4465 + 12 part | 4454 + 15 part | 4467 + 12 part | 4491 + 6 part | 4487 + 6 part | 4492 + 5 part |
NGA50 | 368011 | 230980 | 694825 | 340107 | 1392456 | 496729 | 768685 | 834919 |