S. pneumoniae

Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents [hide]
1 Website data 2 Raw data 3 Fractional data 4 50X coverage data 5 100X coverage data 6 Evaluation

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X

Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X

PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X

Jumping library
Accession : SRR364158
Coverage : 179.2X

Fractional data

We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.187\
JUMP_FRAC=0.558\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
JUMP_FRAC=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

We also use another setting with all jumping library reads.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
IN_GROUPS_CSV=in_groups.csv\
GENOME_SIZE=2165000\
FRAG_COVERAGE=100\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

Evaluation

Benchmark genome

S. pneumoniae TIGR4

Evaluated by QUAST

QUAST (QUAST v2.3)

Running QUAST needs gene list and NC_003028.fna information. There are 2301 genes in total.

Score with QUAST: With PacBio Long Reads more detail

Basic statistics	Website Data	Raw Data	Fractional Data	50X Coverage Data	50X fragment with all jumping	100X Coverage Data	100X fragment with all jumping
# contigs	1	5	1	2	1	1	3
Largest contig	2162245	1340620	2151421	1189234	2149064	2150940	1659817
Total length	2162245	2140045	2151421	2146017	2149064	2150940	2148370
N50	2162245	1340620	2151421	1189234	2149064	2150940	1659817
Misassemblies
# misassemblies	3	2	2	0	1	2	3
Misassembled contigs length	2162245	1340620	2151421	0	2149064	2150940	1659817
Mismatches
# mismatches per 100kbp	5.7	2.15	2.05	2.05	2.1	2.05	1.74
# indels per 100kbp	1.53	1.08	0.93	0.98	0.93	0.98	1.93
# N's per 100kbp	0.05	0.14	0.09	0.14	171.89	0.09	1268.03
Genome statistics
Genome fraction (%)	99.946	99.862	99.45	99.239	99.239	99.43	98.168
Duplication ratio	1.002	1.002	1.001	1.001	1.002	1.001	1.013
# genes	2297 + 4 part	2283 + 10 part	2297 + 4 part	2294+ 4 part	2294 + 4 part	2297 + 4 part	2257 + 22 part
NGA50	1197408	469127	1189418	1188680	1188680	1189418	1234602
Running Time	1hr16m	6hr 05m	41m	53m	44m	39m	1hr 06m

Misassemblies for Adobe reader.

Score with QUAST: Without PacBio Long Reads more detail

Basic statistics	Website Data	Raw Data	Fractional Data	50X Coverage Data	50X fragment with all jumping	100X Coverage Data	100X fragment with all jumping
# contigs	4	6	4	4	7	7	3
Largest contig	1663585	2135901	1671738	1675149	1084537	1812035	1659817
Total length	2161502	2144412	2160013	2163970	2166318	2157620	2148370
N50	1663585	2135901	1671738	1675149	1084537	1812035	1659817
Misassemblies
# misassemblies	2	11	6	5	13	2	3
Misassembled contigs length	1663585	2138844	1949937	1675149	2131766	1812035	1659817
Mismatches
# mismatches per 100kbp	2.16	4.43	2.07	3.2	10.34	2.21	1.74
# indels per 100kbp	1.5	2.29	1.6	1.5	2.69	1.6	1.93
# N's per 100kbp	759.8	1714.74	1505.69	1609.26	1353.91	938.21	1268.03
Genome statistics
Genome fraction (%)	98.777	97.199	98.241	98.431	98.033	98.31	98.168
Duplication ratio	1.013	1.019	1.018	1.017	1.023	1.016	1.013
# genes	2269 + 17 part	2239 + 33 part	2275 + 18 part	2260+ 27 part	2254 + 22 part	2262 + 19 part	2257 + 22 part
NGA50	468023	459198	467165	467048	286777	828148	124602
Running Time	19m	53m	19m	20m	18m	19m	22m

Misassemblies for Adobe reader.