We re-run correction and assembly with the data provided in PBcR closure project.
We corrected the long read sequence data (200X) with illumina short reads (100X), with or without specifying genome size.
pacBioToCA -l viaMiseq -s pacbio.spec -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 miseq.100X.frg.bz2
pacBioToCA -l viaMiseq -s pacbio.spec -t 10 -partitions 200 fastqFile=filtered_subreads.200X.fastq.bz2 genomeSize=4650000 miseq.100X.frg.bz2
200X filtered long reads | Without genomeSize | genomeSize=4650000 |
seqs amount:383482 | seqs amount:332880 | seqs amount:37879 |
seq avg len:2422.877720 | seq avg len:2260.68262 | seq avg len:4927.683492 |
total:929.13 Mb | total:752.54 Mb | total:186.66 Mb |
depth: 199.81X | depth: 161.84X | depth: 40.14X |
In addition to filter 25X PBcR or no filter for assembly, we used different Celera Assembler parameters as described in ref.
runCA -p asm -d asm -s asm.spec viaMiseq.frg
runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0 -p asm -d asm viaMiseq.frg
We therefore had eight assemblies for comparison.
without genomeSize | all PBcR | runCA1 | wo_all_runCA1 |
runCA2 | wo_all_runCA2 | ||
25X PBcR | runCA1 | wo_25X_runCA1 | |
runCA2 | wo_25X_runCA2 | ||
genomeSize=4650000 | all PBcR | runCA1 | w_all_runCA1 |
runCA2 | w_all_runCA2 | ||
25X PBcR | runCA1 | w_25X_runCA1 | |
runCA2 | w_25X_runCA2 | ||