E. coli

Revision as of 22 January 2014 02:02 by admin (Comments | Contribs) | (→100X fragment reads)

(diff) ← Previous revision | Current revision | Next revision → (diff)

Escherichia coli K12 MG1655. The E. coli MG1655 consists of a circular chromosome of 4,639,675 bp in length. The Illumina sequencing data were available at ALLPATHS-LG website, Please refer to Finished bacterial genomes from shotgun sequence data. Genome Research 2012 for detail.

Contents [hide]
1 Published data 2 Raw data 3 Fractional data 4 50X coverage reads 5 100X coverage data 6 Evaluation

Published data

The Illumina and pacbio data were downloaded from ALLPATHS-LG website : ecoli_data_alt.tar.gz

Fragment library
Reads length : 101bp
Reads amount : 1186190 X2
Insert size : 180bp
Coverage : 46.02X
Jumping library 1
Reads length : 93bp
Reads amount : 1615702 X2
Insert size : 3000bp
Jumping library 2
Reads length : 93bp
Reads amount : 362199 X2
Insert size : 3000bp
PacBio reads
Reads average length : 1514.24bp
Reads amount : 409304
Coverage : 133.58X

Raw data

The raw data of website data from Sequence Read Archive (SRA)

Fragment library
Accession : SRX131033
Reads length : 101bp
Reads amount : 13457571 X2
Insert size : 180bp
Coverage : 522.1X
Jumping library 1
Accession : SRX117481

Jumping library 2
Accession : SRR492488

PacBio reads
Accession : SRX109917, SRX109901(SRR386913, SRR387092, SRR386907, SRR387035), SRX109936

Fractional data

We randomly selected the same fraction as website data from fragment library of raw data by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
FRAG_FRAC=0.088\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

50X coverage reads

We randomly selected 50X coverage data from fragment library and 50X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=50\
JUMP_COVERAGE=50\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

100X coverage data

We randomly selected 100X coverage data from fragment library and 100X coverage data from jumping library by prepare.sh.

PrepareAllPathsInputs.pl\
DATA_DIR=$PWD/test.genome/data\
PLOIDY=1\
GENOME_SIZE=4640000\
FRAG_COVERAGE=100\
JUMP_COVERAGE=100\
IN_GROUPS_CSV=in_groups.csv\
IN_LIBS_CSV=in_libs.csv\
OVERWRITE=True\
| tee prepare.out

Evaluation

Benchmark genome

E. coli MG1655

Evaluated by QUAST

QUAST (QUAST v2.2)

Running QUAST needs gene and sequence information. There are 4497 genes in total.

Score with QUAST: With PacBio Long Reads more detail

Basic statistics	Published Data	Raw Data	Fractional Data	50X coverage	100X coverage
# contigs	1	14	1	1	1
Largest contig	4638970	4625005	4638970	4638970	4638970
Total length	4638970	4652215	4638970	4638970	4638970
N50	4638970	4625005	4638970	4638970	4638970
Misassemblies
# misassemblies	1	5	1	1	1
Misassembled contigs length	4638970	4625005	4638970	4638970	4638970
Mismatches
# mismatches per 100kbp	0.11	1.06	0.06	0.09	0.06
# indels per 100kbp	0.09	0.61	0.09	0.09	0.09
# N's per 100kbp	0	282.94	0	0.04	0
Genome statistics
Genome fraction (%)	99.983	99.418	99.983	99.983	99.983
Duplication ratio	1	1.013	1	1	1
# genes	4494 + 1 part	4471 + 2 part	4494 + 1 part	4494 + 1 part	4494 + 1 part
NGA50	4638970	2714032	3763133	4209920	3762305

Score with QUAST: Without PacBio Long Reads more detail

Basic statistics	Published Data	Raw Data	Fractional Data	50X coverage	100X coverage
# contigs	2	1	5	3	2
Largest contig	4631220	4633080	4575759	4629108	4638312
Total length	4633146	4633080	4698903	4633082	4640072
N50	4631220	4633080	4575759	4629108	4638312
Misassemblies
# misassemblies	3	7	8	8	5
Misassembled contigs length	4631220	4633080	4577746	4631095	4638312
Mismatches
# mismatches per 100kbp	1.19	1.42	2.84	2.52	2.26
# indels per 100kbp	1.13	0.83	3.26	1.24	1.85
# N's per 100kbp	533.22	1545.02	698.87	703.96	760.38
Genome statistics
Genome fraction (%)	99.345	98.343	99.265	99.136	99.272
Duplication ratio	1.012	1.016	1.021	1.008	1.008
# genes	4465 + 11 part	4395 + 31 part	4460 + 14 part	4451 + 13 part	4455 + 14 part
NGA50	3180483	687701	654008	2675325	694154