E. coli

Revision as of 19 October 2013 00:15 by admin (Comments | Contribs)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents

Website data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : ecoli_data_alt.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1186190 X2
Insert size : 180bp
Coverage : 46.02X
Jumping library 1
Reads length : 93bp
Reads amount : 1615702 X2
Insert size : 3000bp
Jumping library 2
Reads length : 93bp
Reads amount : 362199 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1514.24bp
Reads amount : 409304
Coverage : 133.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX131033
Reads length : 101bp
Reads amount : 13457571 X2
Insert size : 180bp
Coverage : 522.1X
Jumping library 1
Accession : SRX117481

Jumping library 2
Accession : SRR492488

PacBio reads
Accession : SRX109917, SRX109901(SRR386913, SRR387092, SRR386907, SRR387035), SRX109936

Self-fraction data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.088\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

100X fragment reads

We randomly selected 100X coverage data from fragment library of raw data by prepare.sh.

Fraction = 100 / 522.1 = 0.192

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.192\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out 

Evaluation

  • Benchmark genome
R. sphaeroides 2.4.1
  • Evaluated by QUAST
QUAST (QUAST v2.1)
Running QUAST needs gene and sequence information. There are 4388 genes in total.
Basic statistics ABySS CABOD MaSuRCA SGA SOAPdenovo SPAdes 2.5 Velvet CISA
# contigs 382 131 52 733 185 40 143 35
Largest contig 71578 177098 241348 44874 204500 739647 236829 740054
Total length 4503182 3953489 4247061 4091078 4549335 4781613 4526809 4950301
NG50 21441 40287 144812 7971 45133 518052 85272 523557
Misassemblies
# misassemblies 2 6 5 0 3 9 17 24
Misassembled contigs length  24651 135726 40523 0 24631 1180527 564277 1850338
Genome statistics
Genome fraction (%) 97.529 85.608 91.915 87.77 98.068 99.37 97.437 99.474
# genes 4030 + 275 part  3685 + 128 part  4036 + 38 part  3348 + 592 part  4046 + 307 part  4347 + 23 part  4068 + 265 part  4358 + 19 part 
# mismatches per 100 kbp 4.83 10.51 12.48 3.22 5.98 5.82 13.11 6.42
# indels per 100 kbp 3.43 7.97 5.84 3.34 3.9 3.63 8.34 3.69
# N's per 100kbp 0 20.77 4.83 0 139.23 0 739.09 3.68