Dataset 3, Streptococcus pneumoniae TIGR4. This dataset includes three libraries: fragment, jump and long reads. The S. pneumoniae TIGR4 consists of a circular chromosome of 2,160,842 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.
Contents |
---|
The Illumina and pacbio data were downloaded from ALLPATHS-LG website : strep_data.tar.gz
Fragment library
Reads length : 101bp
Reads amount : 1067060 X2
Insert size : 180bp
Coverage : 99.8X
Jumping library
Reads length : 93bp
Reads amount : 1161884 X2
Insert size : 3000bp
Coverage : 100.0X
PacBio reads
Reads average length : 1159.12bp
Reads amount : 403745
Coverage : 216.6X
The raw Illumina data were obtained from Sequence Read Archive (SRA).
Fragment library
Accession : SRR387335
Reads length : 101bp
Reads amount : 5706200 X2
Insert size : 180bp
Coverage : 533.4X
Jumping library
Accession : SRR364158
Coverage : 179.2X
We randomly selected the same fraction as website data from fragment and jumping library of raw data by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ FRAG_FRAC=0.187\ JUMP_FRAC=0.558\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=50\ JUMP_COVERAGE=50\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We also use another setting with all jumping library reads.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ JUMP_FRAC=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=50\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ GENOME_SIZE=2165000\ FRAG_COVERAGE=100\ JUMP_COVERAGE=100\ IN_GROUPS_CSV=in_groups.csv\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
We also use another setting with all jumping library reads.
PrepareAllPathsInputs.pl\ DATA_DIR=$PWD/test.genome/data\ PLOIDY=1\ IN_GROUPS_CSV=in_groups.csv\ GENOME_SIZE=2165000\ FRAG_COVERAGE=100\ IN_LIBS_CSV=in_libs.csv\ OVERWRITE=True\ | tee prepare.out
Basic statistics | Website Data | Raw Data | Fractional Data | 50X Coverage Data | 50X fragment with all jumping | 100X Coverage Data | 100X fragment with all jumping |
# contigs | 1 | 5 | 1 | 2 | 1 | 1 | 3 |
Largest contig | 2162245 | 1340620 | 2151421 | 1189234 | 2149064 | 2150940 | 1659817 |
Total length | 2162245 | 2140045 | 2151421 | 2146017 | 2149064 | 2150940 | 2148370 |
N50 | 2162245 | 1340620 | 2151421 | 1189234 | 2149064 | 2150940 | 1659817 |
Misassemblies | |||||||
# misassemblies | 3 | 2 | 2 | 0 | 1 | 2 | 3 |
Misassembled contigs length | 2162245 | 1340620 | 2151421 | 0 | 2149064 | 2150940 | 1659817 |
Mismatches | |||||||
# mismatches per 100kbp | 5.7 | 2.15 | 2.05 | 2.05 | 2.1 | 2.05 | 1.74 |
# indels per 100kbp | 1.53 | 1.08 | 0.93 | 0.98 | 0.93 | 0.98 | 1.93 |
# N's per 100kbp | 0.05 | 0.14 | 0.09 | 0.14 | 171.89 | 0.09 | 1268.03 |
Genome statistics | |||||||
Genome fraction (%) | 99.946 | 99.862 | 99.45 | 99.239 | 99.239 | 99.43 | 98.168 |
Duplication ratio | 1.002 | 1.002 | 1.001 | 1.001 | 1.002 | 1.001 | 1.013 |
# genes | 2297 + 4 part | 2283 + 10 part | 2297 + 4 part | 2294+ 4 part | 2294 + 4 part | 2297 + 4 part | 2257 + 22 part |
NGA50 | 1197408 | 469127 | 1189418 | 1188680 | 1188680 | 1189418 | 1234602 |
Running Time | 1hr16m | 6hr 05m | 41m | 53m | 44m | 39m | 1hr 06m |
Misassemblies for Adobe reader.
Basic statistics | Website Data | Raw Data | Fractional Data | 50X Coverage Data | 50X fragment with all jumping | 100X Coverage Data | 100X fragment with all jumping |
# contigs | 4 | 6 | 4 | 4 | 7 | 7 | 3 |
Largest contig | 1663585 | 2135901 | 1671738 | 1675149 | 1084537 | 1812035 | 1659817 |
Total length | 2161502 | 2144412 | 2160013 | 2163970 | 2166318 | 2157620 | 2148370 |
N50 | 1663585 | 2135901 | 1671738 | 1675149 | 1084537 | 1812035 | 1659817 |
Misassemblies | |||||||
# misassemblies | 2 | 11 | 6 | 5 | 13 | 2 | 3 |
Misassembled contigs length | 1663585 | 2138844 | 1949937 | 1675149 | 2131766 | 1812035 | 1659817 |
Mismatches | |||||||
# mismatches per 100kbp | 2.16 | 4.43 | 2.07 | 3.2 | 10.34 | 2.21 | 1.74 |
# indels per 100kbp | 1.5 | 2.29 | 1.6 | 1.5 | 2.69 | 1.6 | 1.93 |
# N's per 100kbp | 759.8 | 1714.74 | 1505.69 | 1609.26 | 1353.91 | 938.21 | 1268.03 |
Genome statistics | |||||||
Genome fraction (%) | 98.777 | 97.199 | 98.241 | 98.431 | 98.033 | 98.31 | 98.168 |
Duplication ratio | 1.013 | 1.019 | 1.018 | 1.017 | 1.023 | 1.016 | 1.013 |
# genes | 2269 + 17 part | 2239 + 33 part | 2275 + 18 part | 2260+ 27 part | 2254 + 22 part | 2262 + 19 part | 2257 + 22 part |
NGA50 | 468023 | 459198 | 467165 | 467048 | 286777 | 828148 | 124602 |
Running Time | 19m | 53m | 19m | 20m | 18m | 19m | 22m |
Misassemblies for Adobe reader.