S. pneumoniae

Revision as of 19 October 2013 02:13 by admin (Comments | Contribs)

(diff) ← Previous revision | Current revision | Next revision → (diff)

Streptococcus pneumoniae TIGR4. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents [hide]
1 Website data 2 Raw data 3 Self-fraction data 4 100X fragment reads 5 Evaluation

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 88.89X
Jumping library
Reads length : 93bp
Reads amount : 1161883 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX110128
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 475.33X
Jumping library
Accession : SRX105406
PacBio reads
Accession : SRX109959,SRX109958

Self-fraction data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

100X fragment reads

We randomly selected 100X coverage data from fragment library of raw data by prepare.sh.

Fragment library fraction = 100/475.12 = 0.21

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.21\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.21\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

Evaluation

Benchmark genome

R. sphaeroides 2.4.1

Evaluated by QUAST

QUAST (QUAST v2.2)

Running QUAST needs gene and sequence information. There are 4438 genes in total.

Score with QUAST: With PacBio Long Reads more detail

Basic statistics	Raw Data	Website Data	Self-fraction Data	100 Coverage
# contigs	13	11	10	11
Largest contig	3188540	3188818	3188847	3188802
Total length	4588701	4601792	4609235	4601762
N50	3188540	3188818	3188847	3188802
Misassemblies
# misassemblies	12	16	20	19
Misassembled contigs length	4361060	4370092	4557570	4484253
Mismatches
# mismatches per 100kbp	3.77	3.48	4.8	6.43
# indels per 100kbp	5.13	3.52	4.87	5.61
# N's per 100kbp	0.09	0	0.13	0.07
Genome statistics
Genome fraction (%)	99.683	99.932	99.948	99.945
Duplication ratio	1.005	1.011	1.009	1.007
# genes	4369 + 10 part	4381 + 6 part	4380+ 7 part	4378 + 8 part
NGA50	2938269	904505	2715665	3170709

Score with QUAST: Without PacBio Long Reads more detail

Basic statistics	Raw Data	Website Data	Self-fraction Data	100 Coverage
# contigs	57	31	32	26
Largest contig	3186675	3188995	1674993	3190277
Total length	4583750	4592561	4620837	4607723
N50	3186675	3188995	1492665	3190277
Misassemblies
# misassemblies	6	9	17	27
Misassembled contigs length	4147900	4205887	2637662	4422750
Mismatches
# mismatches per 100kbp	4.23	5.81	7.49	10.76
# indels per 100kbp	3.57	5.64	4.72	8.94
# N's per 100kbp	149.31	120.74	197.84	812.74
Genome statistics
Genome fraction (%)	98.789	99.45	99.468	98.896
Duplication ratio	1.022	1.018	1.015	1.02
# genes	4313 + 47 part	4348 + 27 part	4343 + 31 part	4266 + 101 part
NGA50	3180491	3182258	1487141	546353