Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.
Contents |
---|
The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz
Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X
Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X
PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X
The raw Illumina data were obtained from Sequence Read Archive (SRA).
Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X
Jumping library
Accession : SRR364158
Coverage : 179.2X
We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ FRAG_FRAC=0.187\ JUMP_FRAC=0.558\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=50\ JUMP_COVERAGE=50\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We also use another setting with all jumping library reads.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ JUMP_FRAC=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=50\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=100\ JUMP_COVERAGE=100\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We also use another setting with all jumping library reads.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ IN_GROUPS_CSV=in_groups.csv\ GENOME_SIZE=2165000\ FRAG_COVERAGE=100\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
Basic statistics | Website Data | Raw Data | Fractional Data | 50X Coverage Data | 50X fragment with all jumping | 100X Coverage Data | 100X fragment with all jumping |
# contigs | 1 | 5 | 1 | 2 | 1 | 1 | 3 |
Largest contig | 2162245 | 1340620 | 2151421 | 1189234 | 2149064 | 2150940 | 1659817 |
Total length | 2162245 | 2140045 | 2151421 | 2146017 | 2149064 | 2150940 | 2148370 |
N50 | 2162245 | 1340620 | 2151421 | 1189234 | 2149064 | 2150940 | 1659817 |
Misassemblies | |||||||
# misassemblies | 4 | 1 | 3 | 0 | 1 | 3 | 7 |
Misassembled contigs length | 2162245 | 1340620 | 2151421 | 0 | 2149064 | 2150940 | 1659817 |
Mismatches | |||||||
# mismatches per 100kbp | 5.7 | 2.15 | 2.05 | 2.05 | 2.1 | 2.05 | 2.22 |
# indels per 100kbp | 3.52 | 1.08 | 0.93 | 3.08 | 0.93 | 0.98 | 13.48 |
# N's per 100kbp | 0.05 | 0.14 | 0.09 | 0.14 | 171.89 | 0.09 | 1268.03 |
Genome statistics | |||||||
Genome fraction (%) | 99.946 | 99.967 | 99.45 | 99.239 | 99.239 | 99.43 | 98.23 |
Duplication ratio | 1.011 | 1.005 | 1.016 | 1.015 | 1.017 | 1.016 | 1.032 |
# genes | 2297 + 4 part | 2299 + 2 part | 2297 + 4 part | 2294+ 4 part | 2294 + 4 part | 2297 + 4 part | 2260 + 19 part |
NGA50 | 1198037 | 1338442 | 1189098 | 1188680 | 1188680 | 1189098 | 483405 |
Running Time | 1hr16m |
Misassemblies for Adobe reader.
Basic statistics | Website Data | Raw Data | Fractional Data | 50X Coverage Data | 50X fragment with all jumping | 100X Coverage Data | 100X fragment with all jumping |
# contigs | 4 | 6 | 4 | 4 | 7 | 7 | 3 |
Largest contig | 1663585 | 2135901 | 1671738 | 1675149 | 1084537 | 1812035 | 1659817 |
Total length | 2161502 | 2144412 | 2160013 | 2163970 | 2166318 | 2157620 | 2148370 |
N50 | 1663585 | 2135901 | 1671738 | 1675149 | 1084537 | 1812035 | 1659817 |
Misassemblies | |||||||
# misassemblies | 5 | 19 | 11 | 10 | 17 | 9 | 7 |
Misassembled contigs length | 1663585 | 2138844 | 1949937 | 1675149 | 2131766 | 2130729 | 1659817 |
Mismatches | |||||||
# mismatches per 100kbp | 2.62 | 4.9 | 2.59 | 3.24 | 10.34 | 2.92 | 2.22 |
# indels per 100kbp | 1.5 | 17.33 | 11.78 | 7.38 | 4.39 | 1.69 | 13.48 |
# N's per 100kbp | 759.8 | 1714.74 | 1505.69 | 1609.26 | 1353.91 | 938.21 | 1268.03 |
Genome statistics | |||||||
Genome fraction (%) | 98.798 | 97.204 | 98.624 | 98.477 | 98.059 | 98.324 | 98.23 |
Duplication ratio | 1.022 | 1.019 | 1.047 | 1.027 | 1.061 | 1.044 | 1.032 |
# genes | 2271 + 15 part | 2239 + 33 part | 2275 + 15 part | 2263+ 25 part | 2255 + 21 part | 2262 + 19 part | 2260 + 19 part |
NGA50 | 403409 | 224016 | 397648 | 376923 | 313633 | 463639 | 483405 |
Running Time | 19m |
Misassemblies for Adobe reader.