S. pneumoniae

Revision as of 12 May 2014 21:19 by admin (Comments | Contribs) | (→Evaluation)

(diff) ← Previous revision | Current revision | Next revision → (diff)

Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents [hide]
1 Website data 2 Raw data 3 Fractional data 4 50X coverage data 5 100X coverage data 6 Evaluation

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X

Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X

PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X

Jumping library
Accession : SRR364158
Coverage : 179.2X

Fractional data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
JUMP_FRAC=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
IN_GROUPS_CSV=in_groups.csv\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

Evaluation

Benchmark genome

S. pneumoniae TIGR4

Evaluated by QUAST

QUAST (QUAST v2.2)

Running QUAST needs gene list and NC_003028.fna information. There are 2301 genes in total.

Score with QUAST: With PacBio Long Reads more detail

Basic statistics	Website Data	Raw Data	Fractional Data	50X Coverage Data	50X fragment with all jumping	100X Coverage Data	100X fragment with all jumping
# contigs	1	5	1	2	1	1	3
Largest contig	2162245	1340620	2151421	1189234	2149064	2150940	1659817
Total length	2162245	2140045	2151421	2146017	2149064	2150940	2148370
N50	2162245	1340620	2151421	1189234	2149064	2150940	1659817
Misassemblies
# misassemblies	4	1	3	0	1	3	7
Misassembled contigs length	2162245	1340620	2151421	0	2149064	2150940	1659817
Mismatches
# mismatches per 100kbp	5.7	2.15	2.05	2.05	2.1	2.05	2.22
# indels per 100kbp	3.52	1.08	0.93	3.08	0.93	0.98	13.48
# N's per 100kbp	0.05	0.14	0.09	0.14	171.89	0.09	1268.03
Genome statistics
Genome fraction (%)	99.946	99.967	99.45	99.239	99.239	99.43	98.23
Duplication ratio	1.011	1.005	1.016	1.015	1.017	1.016	1.032
# genes	2297 + 4 part	2299 + 2 part	2297 + 4 part	2294+ 4 part	2294 + 4 part	2297 + 4 part	2260 + 19 part
NGA50	1198037	1338442	1189098	1188680	1188680	1189098	483405
Running Time	1hr16m

Misassemblies for Adobe reader.

Score with QUAST: Without PacBio Long Reads more detail

Basic statistics	Website Data	Raw Data	Fractional Data	50X Coverage Data	50X fragment with all jumping	100X Coverage Data	100X fragment with all jumping
# contigs	4	6	4	4	7	7	3
Largest contig	1663585	2135901	1671738	1675149	1084537	1812035	1659817
Total length	2161502	2144412	2160013	2163970	2166318	2157620	2148370
N50	1663585	2135901	1671738	1675149	1084537	1812035	1659817
Misassemblies
# misassemblies	5	19	11	10	17	9	7
Misassembled contigs length	1663585	2138844	1949937	1675149	2131766	2130729	1659817
Mismatches
# mismatches per 100kbp	2.62	4.9	2.59	3.24	10.34	2.92	2.22
# indels per 100kbp	1.5	17.33	11.78	7.38	4.39	1.69	13.48
# N's per 100kbp	759.8	1714.74	1505.69	1609.26	1353.91	938.21	1268.03
Genome statistics
Genome fraction (%)	98.798	97.204	98.624	98.477	98.059	98.324	98.23
Duplication ratio	1.022	1.019	1.047	1.027	1.061	1.044	1.032
# genes	2271 + 15 part	2239 + 33 part	2275 + 15 part	2263+ 25 part	2255 + 21 part	2262 + 19 part	2260 + 19 part
NGA50	403409	224016	397648	376923	313633	463639	483405
Running Time	19m

Misassemblies for Adobe reader.