S. pneumoniae

Revision as of 28 January 2014 00:11 by admin (Comments | Contribs) | (Evaluation)

Streptococcus pneumoniae TIGR4. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Published data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 88.89X
Jumping library
Reads length : 93bp
Reads amount : 1161883 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX110128
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 475.33X
Jumping library
Accession : SRX105406
PacBio reads
Accession : SRX109959,SRX109958

Fractional data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

50X coverage reads

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
JUMP_FRAC=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X coverage reads

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
IN_GROUPS_CSV=in_groups.csv\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 


Evaluation

  • Benchmark genome
S. pneumoniae TIGR4
  • Evaluated by QUAST
QUAST (QUAST v2.2)
Running QUAST needs gene and sequence information. There are 2301 genes in total.
  • Score with QUAST: With PacBio Long Reads more detail
Basic statistics Published Data Raw Data Fractional Data 50X Coverage Data 50X fragment with all jumping 100X Coverage Data 10X fragment with all jumping


# contigs 1 5 1 2 1 1 1
Largest contig 2162245 1340620 2151421 1189234 2149064 2150940 2153124
Total length 2162245 2140045 2151421 2146017 2149064 2150940 2153124
N50 2162245 1340620 2151421 1189234 2149064 2150940 2153124
Misassemblies
# misassemblies 4 1 3 0 1 3 1
Misassembled contigs length 2162245 1340620 2151421 0 2149064 2150940 2153124
Mismatches
# mismatches per 100kbp 5.7 2.15 2.05 2.05 2.1 2.05 2.05
# indels per 100kbp 3.52 1.08 0.93 3.08 0.93 0.98 1.07
# N's per 100kbp 0.05 0.14 0.09 0.14 171.89 0.09 172.96
Genome statistics
Genome fraction (%) 99.946 99.967 99.45 99.239 99.239 99.43 99.423
Duplication ratio 1.011 1.005 1.016 1.015 1.017 1.016 1.012
# genes 2297 + 4 part  2299 + 2 part  2297 + 4 part 2294+ 4 part  2294 + 4 part 2297 + 4 part 2298 + 3 part
NGA50 1198037 1338442 1189098 1188680 1188680 1189098 1590348
  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Raw Data Website Data Self-fraction Data 100 Coverage
# contigs 6 4 4 4
Largest contig 2135901 1663585 1671738 1664345
Total length 2144412 2161502 2160013 2156958
N50 2135901 1663585 1671738 1664345
Misassemblies
# misassemblies 19 5 11 9
Misassembled contigs length 2138844 1663585 1949937 2154315
Mismatches
# mismatches per 100kbp 4.9 2.62 2.59 3.33
# indels per 100kbp 17.33 1.5 11.78 14.39
# N's per 100kbp 1714.74 759.8 1505.69 994.27
Genome statistics
Genome fraction (%) 97.204 98.798 98.624 98.72
Duplication ratio 1.019 1.022 1.047 1.021
# genes 2239 + 33 part  2271 + 15 part  2275 + 15 part  2266 + 17 part
NGA50 224016 403409 397648 1533035