RunCA

We re-run correction and assembly with the data provided in PBcR closure project.

We corrected the long read sequence data (200X) with illumina short reads (100X), with or without specifying genome size.

pacBioToCA -l viaMiseq -s pacbio.spec  -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 miseq.100X.frg.bz2
pacBioToCA -l viaMiseq -s pacbio.spec  -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 genomeSize=4650000 miseq.100X.frg.bz2
200X filtered long reads Without genomeSize genomeSize=4650000
seqs amount:383482 seqs amount:332880 seqs amount:37879
seq avg len:2422.877720 seq avg len:2260.68262 seq avg len:4927.683492
total:929.13 Mb total:752.54 Mb total:186.66 Mb
depth: 199.81X depth: 161.84X depth: 40.14X

In addition to filter 25X PBcR or no filter for assembly, we used different Celera Assembler parameters as described in ref.

runCA1: runCA -p asm -d asm -s asm.spec viaMiseq.frg
runCA2: runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg

We therefore had eight assemblies for comparison.

PacbioToCA Gatekeeper RunCA Name of assembly
without genomeSize all PBcR runCA1 wo_all_runCA1
runCA2 wo_all_runCA2
25X PBcR runCA1 wo_25X_runCA1
runCA2 wo_25X_runCA2
genomeSize=4650000 all PBcR runCA1 w_all_runCA1
runCA2 w_all_runCA2
25X PBcR runCA1 w_25X_runCA1
runCA2 w_25X_runCA2


Evaluation

We have evaluated the assemblies with QUAST 2.3(reference genome NC_000913 and Ec_gene_list).

Statistics without reference wo_all_runCA1 wo_all_runCA2 wo_25X_runCA1 wo_25X_runCA2 w_all_runCA1 w_all_runCA2 w_25X_runCA1 w_25X_runCA2
# contigs 248 302 15 36 36 76 17 37
Largest contig 762617 578194 1341765 694845 2069213 1204169 2021284 1754178
Total length 5539907 6343154 4656826 4814122 4943194 5322102 4800259 5002926
N50 371897 142793 1155479 460790 1478089 496729 1215597 1253429
Misassemblies
# misassemblies 36 38 8 14 12 17 11 15
Misassembled contigs length 1443585 1419385 2506591 1302083 3703087 1183356 2022749 2142054
Mismatches
# mismatches per 100 kbp 2.81 1.150 0.7 0.76 3.66 3.88 4.310 4.33
# indels per 100 kbp 12.97 8.02 2.19 2.54 0.52 0.86 1.06 0.65
# N's per 100 kbp 1.32 0.38 0 0 0 0.15 0.02 0.2
Genome statistics
Genome fraction (%) 99.858 99.651 99.189 99.196 100 100 100 100
Duplication ratio 1.195 1.373 1.012 1.046 1.066 1.148 1.035 1.079
# genes 4484 + 10 part 4481 + 12 part 4466 + 11 part 4466 + 11 part 4494 + 3 part 4494 + 3 part 4494 + 3 part 4493 + 4 part
NGA50 411646 230980 694825 366166 670145 677677 877912 834940

We discarded the contigs which fewer than 100 reads aligned. more detail

Statistics without reference wo_all_runCA1 wo_all_runCA2 wo_25X_runCA1 wo_25X_runCA2 w_all_runCA1 w_all_runCA2 w_25X_runCA1 w_25X_runCA2
# contigs 34 56 11 19 7 16 5 7
Largest contig 762617 578194 1341765 694845 2069213 1204169 2021284 1754178
Total length 4798555 4933914 4612517 4629273 4635310 4744173 4666475 4690899
N50 412414 318728 1155479 498347 1478089 678445 1215597 1253429
Misassemblies
# misassemblies 9 8 8 12 10 8 8 8
Misassembled contigs length 1287728 1257103 2497244 1278777 3693737 1111214 1994551 2084097
Mismatches
# mismatches per 100 kbp 1.49 1.06 0.72 0.81 3.38 3.9 4.29 4.18
# indels per 100 kbp 11.04 7.93 2.17 2.35 0.39 0.86 0.5 0.52
# N's per 100 kbp 0.15 0.04 0 0 0 0.08 0.02 0
Genome statistics
Genome fraction (%) 99.769 99.515 99.189 98.957 99.561 99.985 99.944 100
Duplication ratio 1.037 1.07 1.002 1.008 1.003 1.023 1.006 1.011
# genes 4479 + 13 part 4469 + 14 part 4465 + 12 part 4454 + 15 part 4467 + 12 part 4491 + 6 part 4487 + 6 part 4492 + 5 part
NGA50 368011 230980 694825 340107 1392456 496729 768685 834919