R. sphaeroides

Dataset 2, Rhodobacter sphaeroides strain 2.4.1. This dataset includes three libraries: fragment, jump and long reads. The R. sphaeroides 2.4.1 consists of two circular chromosomes of 3,188,609 bp and 943,016 bp, and five plasmids of 114,045 bp, 114,178 bp, 105,284 bp, 100,828 bp and 37,100 bp in length, respectively. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : rhody_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 4354215 X2
Insert size : 180bp
Coverage : 191.0X

Jumping library
Reads length : 101bp
Reads amount : 1974031 X2
Insert size : 3000bp

PacBio reads
Reads average length : 1031.19bp
Reads amount : 1994107
Coverage : 446.4X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR125492
Reads length : 101bp
Reads amount : 11339101 X2
Insert size : 180bp
Coverage : 497.3X

Jumping library
Accession : SRR388672

Fractional data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.384\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4610000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 


100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4600000\
FRAG_COVERAGE=100\
JUMP_COVETAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
R. sphaeroides 2.4.1
  • Evaluated by QUAST
QUAST (QUAST v2.3)
Running QUAST requires gene list and reference genome. There are 4388 genes in total.
Basic statistics Website Data Raw Data Fractional Data 50X Coverage 100X Coverage
# contigs 11 13 10 NA 12
Largest contig 3188818 3188540 3188847 NA 3188773
Total length 4601792 4588701 4609235 NA 4602493
N50 3188818 3188540 3188847 NA 3188773
Misassemblies
# misassemblies 3 3 4 NA 4
Misassembled contigs length 133056 1067239 206335 NA 196336
Mismatches
# mismatches per 100kbp 3.04 3.69 4.15 NA 6.35
# indels per 100kbp 2.91 3.93 3.65 NA 5.48
# N's per 100kbp 0 0.09 0.13 NA 45.04
Genome statistics
Genome fraction (%) 99.932 99.583 99.927 NA 99.834
Duplication ratio 1.001 1.001 1.002 NA 1.002
# genes 4381 + 6 part  4365 + 11 part  4379+ 7 part  NA 4372 + 14 part
NGA50 3188814 3188540 3188795 NA 3188333
Running Time 2hr12m 5hr 09m 1hr 49m NA 3hr 33m

Misassemblies for Adobe reader.


  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Website Data Raw Data Fractional Data 50X Coverage 100X Coverage
# contigs 31 57 32 79 29
Largest contig 3188995 3186675 1674993 263863 2634704
Total length 4592561 4583750 4620837 4096056 4628027
N50 3188995 3186675 1492665 99916 2634704
Misassemblies
# misassemblies 6 4 10 249 16
Misassembled contigs length 4163443 4147900 2637662 3829308 3815869
Mismatches
# mismatches per 100kbp 5.81 4.23 7.41 21.37 7.02
# indels per 100kbp 3.65 3.57 4.5 4.88 4.99
# N's per 100kbp 120.74 149.31 197.84 227311 1572.6
Genome statistics
Genome fraction (%) 99.417 98.669 99.437 64.402 98.348
Duplication ratio 1.004 1.009 1.01 1.38 1.022
# genes 4345 + 30 part  4308 + 47 part  4341 + 31 part  2176+ 1205part 4183 + 185 part
NGA50 3182258 3180491 1486855 9975 511933
Running Time 41m 1hr 01m 36m 19m 33m

Misassemblies for Adobe reader.