E. coli

Dataset 1, Escherichia coli K12 MG1655. This dataset includes three libraries: fragment, jump and long reads. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length. Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : ecoli_data_alt.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1186191 X2
Insert size : 180bp
Coverage : 51.6X

Jumping library 1
Reads length : 93bp
Reads amount : 1615703 X2
Insert size : 3000bp

Jumping library 2
Reads length : 93bp
Reads amount : 362200 X2
Insert size : 3000bp

PacBio reads
Reads average length : 1514.24bp
Reads amount : 409304
Coverage : 133.58X

Raw data

The raw Illumina data were obtained from Sequence Read Archive (SRA).

Fragment library
Accession : SRR447685
Reads length : 101bp
Reads amount : 13457571 X2
Insert size : 180bp
Coverage : 585.9X
Jumping library 1
Accession : SRR401827

Jumping library 2
Accession : SRR492488


Fractional data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.088\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 


library_name,  project_name,  organism_name,      type,  paired,  frag_size,  frag_stddev,  insert_size,  insert_stddev,  read_orientation,  genomic_start,  genomic_end
Solexa-36226,    hybrid_egg,            egg,  fragment,       1,        180,           15,             ,               ,            inward,              0,            0
Solexa-62929,    hybrid_egg,            egg,   jumping,       1,           ,             ,         3000,            600,          out ward,              0,            0
      pacbio,    hybrid_egg,            egg,      long,       0,           ,             ,             ,               ,                  ,              0,            0

50X coverage data

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
E. coli MG1655
  • Evaluated by QUAST
QUAST (QUAST v2.3)
Running QUAST requires Ec_gene_list and NC_000913.fna. There are 4497 genes in total.
Basic statistics Website data Raw Data Fractional Data 50X coverage 100X coverage
# contigs 1 14 1 1 1
Largest contig 4638970 4625005 4638970 4638970 4638970
Total length 4638970 4652215 4638970 4638970 4638970
N50 4638970 4625005 4638970 4638970 4638970
Misassemblies
# misassemblies 0 5 0 0 0
Misassembled contigs length 0 4625005 0 0 0
Mismatches
# mismatches per 100kbp 0.11 1.06 0.06 0.09 0.06
# indels per 100kbp 0.09 0.15 0.09 0.09 0.09
# N's per 100kbp 0 282.94 0 0.04 0
Genome statistics
Genome fraction (%) 99.983 99.418 99.983 99.983 99.983
Duplication ratio 1 1.009 1 1 1
# genes 4494 + 1 part 4471 + 2 part 4494 + 1 part  4494 + 1 part 4494 + 1 part
NGA50 4638970 2714032 4638970 4638970 4638970
Running Time 42m 2hr 35m 35m 35m 41m

Misassemblies for Adobe reader.


  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Website data Raw Data Fractional Data 50X coverage 100X coverage
# contigs 2 1 5 3 2
Largest contig 4631220 4633080 4575759 4629108 4638312
Total length 4633146 4633080 4698903 4633082 4640072
N50 4631220 4633080 4575759 4629108 4638312
Misassemblies
# misassemblies 3 7 8 8 5
Misassembled contigs length 4631220 4633080 4577746 4631095 4638312
Mismatches
# mismatches per 100kbp 1.19 1.42 2.84 2.52 2.26
# indels per 100kbp 0.41 0.53 0.72 0.48 0.52
# N's per 100kbp 533.22 1545.02 698.87 703.96 760.38
Genome statistics
Genome fraction (%) 99.345 98.343 99.265 99.136 99.272
Duplication ratio 1.005 1.016 1.021 1.008 1.008
# genes 4465 + 11 part 4395 + 31 part 4460 + 14 part 4451 + 13 part 4455 + 14 part
NGA50 3180483 687701 654008 2675325 694154
Running Time 28m 2hr 01m 31m 28m 32m

Misassemblies for Adobe reader.