E. coli

Revision as of 19 October 2013 00:21 by admin (Comments | Contribs) | (Evaluation)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : ecoli_data_alt.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1186190 X2
Insert size : 180bp
Coverage : 46.02X
Jumping library 1
Reads length : 93bp
Reads amount : 1615702 X2
Insert size : 3000bp
Jumping library 2
Reads length : 93bp
Reads amount : 362199 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1514.24bp
Reads amount : 409304
Coverage : 133.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX131033
Reads length : 101bp
Reads amount : 13457571 X2
Insert size : 180bp
Coverage : 522.1X
Jumping library 1
Accession : SRX117481

Jumping library 2
Accession : SRR492488

PacBio reads
Accession : SRX109917, SRX109901(SRR386913, SRR387092, SRR386907, SRR387035), SRX109936

Self-fraction data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.088\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X fragment reads

We randomly selected 100X coverage data from fragment library of raw data by prepare.sh.

Fraction = 100 / 522.1 = 0.192

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.192\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
E. coli MG1655
  • Evaluated by QUAST
QUAST (QUAST v2.1)
Running QUAST needs gene and sequence information. There are 4388 genes in total.
  • Score with QUAST: more detail
Basic statistics Raw Data Website Data Self-fraction Data 100 Coverage
# contigs 14 1 1 1
Largest contig 71578 177098 241348 44874 204500 739647 236829 740054
Total length 4503182 3953489 4247061 4091078 4549335 4781613 4526809 4950301
NG50 21441 40287 144812 7971 45133 518052 85272 523557
Misassemblies
# misassemblies 2 6 5 0 3 9 17 24
Misassembled contigs length  24651 135726 40523 0 24631 1180527 564277 1850338
Genome statistics
Genome fraction (%) 97.529 85.608 91.915 87.77 98.068 99.37 97.437 99.474
# genes 4030 + 275 part  3685 + 128 part  4036 + 38 part  3348 + 592 part  4046 + 307 part  4347 + 23 part  4068 + 265 part  4358 + 19 part 
# mismatches per 100 kbp 4.83 10.51 12.48 3.22 5.98 5.82 13.11 6.42
# indels per 100 kbp 3.43 7.97 5.84 3.34 3.9 3.63 8.34 3.69
# N's per 100kbp 0 20.77 4.83 0 139.23 0 739.09 3.68