S. pneumoniae

Revision as of 12 May 2014 21:19 by admin (Comments | Contribs) | (Evaluation)

Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X

Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X

PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X

Jumping library
Accession : SRR364158
Coverage : 179.2X

Fractional data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
JUMP_FRAC=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
IN_GROUPS_CSV=in_groups.csv\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 


Evaluation

  • Benchmark genome
S. pneumoniae TIGR4
  • Evaluated by QUAST
QUAST (QUAST v2.2)
Running QUAST needs gene list and NC_003028.fna information. There are 2301 genes in total.
Basic statistics Website Data Raw Data Fractional Data 50X Coverage Data 50X fragment with all jumping 100X Coverage Data 100X fragment with all jumping
# contigs 1 5 1 2 1 1 3
Largest contig 2162245 1340620 2151421 1189234 2149064 2150940 1659817
Total length 2162245 2140045 2151421 2146017 2149064 2150940 2148370
N50 2162245 1340620 2151421 1189234 2149064 2150940 1659817
Misassemblies
# misassemblies 4 1 3 0 1 3 7
Misassembled contigs length 2162245 1340620 2151421 0 2149064 2150940 1659817
Mismatches
# mismatches per 100kbp 5.7 2.15 2.05 2.05 2.1 2.05 2.22
# indels per 100kbp 3.52 1.08 0.93 3.08 0.93 0.98 13.48
# N's per 100kbp 0.05 0.14 0.09 0.14 171.89 0.09 1268.03
Genome statistics
Genome fraction (%) 99.946 99.967 99.45 99.239 99.239 99.43 98.23
Duplication ratio 1.011 1.005 1.016 1.015 1.017 1.016 1.032
# genes 2297 + 4 part  2299 + 2 part  2297 + 4 part 2294+ 4 part  2294 + 4 part 2297 + 4 part 2260 + 19 part
NGA50 1198037 1338442 1189098 1188680 1188680 1189098 483405
Running Time 1hr16m

Misassemblies for Adobe reader.


  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Website Data Raw Data Fractional Data 50X Coverage Data 50X fragment with all jumping 100X Coverage Data 100X fragment with all jumping
# contigs 4 6 4 4 7 7 3
Largest contig 1663585 2135901 1671738 1675149 1084537 1812035 1659817
Total length 2161502 2144412 2160013 2163970 2166318 2157620 2148370
N50 1663585 2135901 1671738 1675149 1084537 1812035 1659817
Misassemblies
# misassemblies 5 19 11 10 17 9 7
Misassembled contigs length 1663585 2138844 1949937 1675149 2131766 2130729 1659817
Mismatches
# mismatches per 100kbp 2.62 4.9 2.59 3.24 10.34 2.92 2.22
# indels per 100kbp 1.5 17.33 11.78 7.38 4.39 1.69 13.48
# N's per 100kbp 759.8 1714.74 1505.69 1609.26 1353.91 938.21 1268.03
Genome statistics
Genome fraction (%) 98.798 97.204 98.624 98.477 98.059 98.324 98.23
Duplication ratio 1.022 1.019 1.047 1.027 1.061 1.044 1.032
# genes 2271 + 15 part  2239 + 33 part  2275 + 15 part 2263+ 25 part  2255 + 21 part 2262 + 19 part 2260 + 19 part
NGA50 403409 224016 397648 376923 313633 463639 483405
Running Time 19m

Misassemblies for Adobe reader.