E. coli

Revision as of 21 January 2014 21:01 by admin (Comments | Contribs) | (50X fragment reads)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Published data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : ecoli_data_alt.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1186190 X2
Insert size : 180bp
Coverage : 46.02X
Jumping library 1
Reads length : 93bp
Reads amount : 1615702 X2
Insert size : 3000bp
Jumping library 2
Reads length : 93bp
Reads amount : 362199 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1514.24bp
Reads amount : 409304
Coverage : 133.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX131033
Reads length : 101bp
Reads amount : 13457571 X2
Insert size : 180bp
Coverage : 522.1X
Jumping library 1
Accession : SRX117481

Jumping library 2
Accession : SRR492488

PacBio reads
Accession : SRX109917, SRX109901(SRR386913, SRR387092, SRR386907, SRR387035), SRX109936

Fractional data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.088\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

50X fragment reads

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X fragment reads

We randomly selected 100X coverage data from fragment library and jumping library by prepare.sh.


PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
E. coli MG1655
  • Evaluated by QUAST
QUAST (QUAST v2.2)
Running QUAST needs gene and sequence information. There are 4497 genes in total.
  • Score with QUAST: With PacBio Long Reads more detail
Basic statistics Published Data Raw Data Fractional Data 50X coverage 100X coverage
# contigs 1 14 1 1 1
Largest contig 4638970 4625005 4638970 4638970 4638970
Total length 4638970 4652215 4638970 4638970 4638970
N50 4638970 4625005 4638970 4638970 4638970
Misassemblies
# misassemblies 1 5 1 1 1
Misassembled contigs length 4638970 4625005 4638970 4638970 4638970
Mismatches
# mismatches per 100kbp 0.11 1.06 0.06 0.09 0.06
# indels per 100kbp 0.09 0.61 0.09 0.09 0.09
# N's per 100kbp 0 282.94 0 0.04 0
Genome statistics
Genome fraction (%) 99.983 99.418 99.983 99.983 99.983
Duplication ratio 1 1.013 1 1 1
# genes 4494 + 1 part 4471 + 2 part  4494 + 1 part  4494 + 1 part 4494 + 1 part
NGA50 4638970 2714032 3763133 4209920 3762305
  • Score with QUAST: Without PacBio Long Reads more detail
Basic statistics Published Data Raw Data Fractional Data 50X coverage 100X coverage
# contigs 2 1 5 3 2
Largest contig 4631220 4633080 4575759 4629108 4638312
Total length 4633146 4633080 4698903 4633082 4640072
N50 4631220 4633080 4575759 4629108 4638312
Misassemblies
# misassemblies 3 7 8 8 5
Misassembled contigs length 4631220 4633080 4577746 4631095 4638312
Mismatches
# mismatches per 100kbp 1.19 1.42 2.84 2.52 2.26
# indels per 100kbp 1.13 0.83 3.26 1.24 1.85
# N's per 100kbp 533.22 1545.02 698.87 703.96 760.38
Genome statistics
Genome fraction (%) 99.345 98.343 99.265 99.136 99.272
Duplication ratio 1.012 1.016 1.021 1.008 1.008
# genes 4465 + 11 part  4395 + 31 part  4460 + 14 part  4451 + 13 part 4455 + 14 part
NGA50 3180483 687701 654008 2675325 694154