S. pneumoniae

Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X

Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X

PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X

Jumping library
Accession : SRR364158
Coverage : 179.2X

Fractional data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
JUMP_FRAC=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
IN_GROUPS_CSV=in_groups.csv\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 


Evaluation

  • Benchmark genome
S. pneumoniae TIGR4
  • Evaluated by QUAST
QUAST (QUAST v2.3)
Running QUAST needs gene list and NC_003028.fna information. There are 2301 genes in total.
Basic statistics Website Data Raw Data Fractional Data 50X Coverage Data 50X fragment with all jumping 100X Coverage Data 100X fragment with all jumping
# contigs 1 5 1 2 1 1 3
Largest contig 2162245 1340620 2151421 1189234 2149064 2150940 1659817
Total length 2162245 2140045 2151421 2146017 2149064 2150940 2148370
N50 2162245 1340620 2151421 1189234 2149064 2150940 1659817
Misassemblies
# misassemblies 3 2 2 0 1 2 3
Misassembled contigs length 2162245 1340620 2151421 0 2149064 2150940 1659817
Mismatches
# mismatches per 100kbp 5.7 2.15 2.05 2.05 2.1 2.05 1.74
# indels per 100kbp 1.53 1.08 0.93 0.98 0.93 0.98 1.93
# N's per 100kbp 0.05 0.14 0.09 0.14 171.89 0.09 1268.03
Genome statistics
Genome fraction (%) 99.946 99.862 99.45 99.239 99.239 99.43 98.168
Duplication ratio 1.002 1.002 1.001 1.001 1.002 1.001 1.013
# genes 2297 + 4 part  2283 + 10 part  2297 + 4 part 2294+ 4 part  2294 + 4 part 2297 + 4 part 2257 + 22 part
NGA50 1197408 469127 1189418 1188680 1188680 1189418 1234602
Running Time 1hr16m 6hr 05m 41m 53m 44m 39m 1hr 06m

Misassemblies for Adobe reader.


  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Website Data Raw Data Fractional Data 50X Coverage Data 50X fragment with all jumping 100X Coverage Data 100X fragment with all jumping
# contigs 4 6 4 4 7 7 3
Largest contig 1663585 2135901 1671738 1675149 1084537 1812035 1659817
Total length 2161502 2144412 2160013 2163970 2166318 2157620 2148370
N50 1663585 2135901 1671738 1675149 1084537 1812035 1659817
Misassemblies
# misassemblies 2 11 6 5 13 2 3
Misassembled contigs length 1663585 2138844 1949937 1675149 2131766 1812035 1659817
Mismatches
# mismatches per 100kbp 2.16 4.43 2.07 3.2 10.34 2.21 1.74
# indels per 100kbp 1.5 2.29 1.6 1.5 2.69 1.6 1.93
# N's per 100kbp 759.8 1714.74 1505.69 1609.26 1353.91 938.21 1268.03
Genome statistics
Genome fraction (%) 98.777 97.199 98.241 98.431 98.033 98.31 98.168
Duplication ratio 1.013 1.019 1.018 1.017 1.023 1.016 1.013
# genes 2269 + 17 part  2239 + 33 part  2275 + 18 part 2260+ 27 part  2254 + 22 part 2262 + 19 part 2257 + 22 part
NGA50 468023 459198 467165 467048 286777 828148 124602
Running Time 19m 53m 19m 20m 18m 19m 22m

Misassemblies for Adobe reader.