S. pneumoniae

Revision as of 19 October 2013 02:13 by admin (Comments | Contribs)

Streptococcus pneumoniae TIGR4. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 88.89X
Jumping library
Reads length : 93bp
Reads amount : 1161883 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX110128
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 475.33X
Jumping library
Accession : SRX105406
PacBio reads
Accession : SRX109959,SRX109958

Self-fraction data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X fragment reads

We randomly selected 100X coverage data from fragment library of raw data by prepare.sh.

Fragment library fraction = 100/475.12 = 0.21

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.21\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.21\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
R. sphaeroides 2.4.1
  • Evaluated by QUAST
QUAST (QUAST v2.2)
Running QUAST needs gene and sequence information. There are 4438 genes in total.
  • Score with QUAST: With PacBio Long Reads more detail
Basic statistics Raw Data Website Data Self-fraction Data 100 Coverage
# contigs 13 11 10 11
Largest contig 3188540 3188818 3188847 3188802
Total length 4588701 4601792 4609235 4601762
N50 3188540 3188818 3188847 3188802
Misassemblies
# misassemblies 12 16 20 19
Misassembled contigs length 4361060 4370092 4557570 4484253
Mismatches
# mismatches per 100kbp 3.77 3.48 4.8 6.43
# indels per 100kbp 5.13 3.52 4.87 5.61
# N's per 100kbp 0.09 0 0.13 0.07
Genome statistics
Genome fraction (%) 99.683 99.932 99.948 99.945
Duplication ratio 1.005 1.011 1.009 1.007
# genes 4369 + 10 part  4381 + 6 part  4380+ 7 part  4378 + 8 part
NGA50 2938269 904505 2715665 3170709
  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Raw Data Website Data Self-fraction Data 100 Coverage
# contigs 57 31 32 26
Largest contig 3186675 3188995 1674993 3190277
Total length 4583750 4592561 4620837 4607723
N50 3186675 3188995 1492665 3190277
Misassemblies
# misassemblies 6 9 17 27
Misassembled contigs length 4147900 4205887 2637662 4422750
Mismatches
# mismatches per 100kbp 4.23 5.81 7.49 10.76
# indels per 100kbp 3.57 5.64 4.72 8.94
# N's per 100kbp 149.31 120.74 197.84 812.74
Genome statistics
Genome fraction (%) 98.789 99.45 99.468 98.896
Duplication ratio 1.022 1.018 1.015 1.02
# genes 4313 + 47 part  4348 + 27 part  4343 + 31 part  4266 + 101 part
NGA50 3180491 3182258 1487141 546353