Escherichia coli EHEC Germany outbreak preliminary functional annotation using
BG7 system
Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja-Tobes, Eduardo Pareja, Raquel Tobes
Oh no sequences! Group. Era7 Bioinformatics.
TY-2482 genome
version 2
This is the automatic annotation of the second BGI assembly of the E. coli TY-2482 genome:
In this case BGI combined 200x of Illumina single-end reads and 12x of Ion Torrent. They have done a de novo assembly with Newbler v. 2.0.00.22, Soapdenovo v. 1.06 and AMOS minimus2 v. 1.59 getting finally 513 contigs.
For the automatic annotation we have used a set of 137,063 proteins that includes:
Preliminary
RESULTS:
It have been predict predicted 5,982 genes:
The genome has genes annotated as Restriction-modification:
Plasmids:
There are 6 genes encoding adhesion AIDA-I and AIDA-I like proteins
The contig 485 contain a set of geens involved in Tellurium resistance
Secretion systems:
Big regions encoding genes involved in fimbria and flagella production
Several regions probably belonging to phages
Detected toxin genes:
Contig_id |
Gen_id |
Protein names |
Organism |
2 |
46657 |
Toxin-antitoxin system, toxin component, PIN family |
Escherichia coli MS 185-1 |
36 |
35289 |
Toxin-antitoxin system, antitoxin component, AbrB family |
Escherichia coli MS 187-1 |
41 |
63572 |
Toxin ChpB of the ChpB-ChpS toxin-antitoxin system |
Escherichia coli O26:H11 (strain 11368 / EHEC) |
45 |
40450 |
Toxin-antitoxin system, antitoxin component, Xre family |
Escherichia coli MS 182-1 |
52 |
42760 |
Toxin-antitoxin system, antitoxin component, Xre family |
Escherichia coli MS 16-3 |
65 |
66703 |
Small toxic membrane polypeptide |
Escherichia coli O55:H7 (strain CB9615 / EPEC) |
65 |
88671 |
Toxin ChpA |
Escherichia coli O26:H11 (strain 11368 / EHEC) |
67 |
54819 |
Toxin of the YafQ-DinJ toxin-antitoxin system |
Escherichia coli (strain 55989 / EAEC) |
67 |
66484 |
Predicted antitoxin of YafQ-DinJ toxin-antitoxin system |
Escherichia coli O26:H11 (strain 11368 / EHEC) |
70 |
78574 |
Hok/gef cell toxic protein |
Escherichia coli (strain ATCC 55124 / KO11) |
76 |
41767 |
Toxin-antitoxin system protein |
Escherichia coli MS 107-1 |
91 |
68828 |
Antitoxin of the YoeB-YefM toxin-antitoxin system |
Escherichia coli O1:K1 / APEC |
91 |
69733 |
Toxin of the YoeB-YefM toxin-antitoxin system |
Escherichia coli O26:H11 (strain 11368 / EHEC) |
108 |
107031 |
Shiga toxin II subunit B |
Escherichia coli O157:H7 (strain TW14359 / EHEC) |
108 |
41845 |
Shiga toxin subunit A (EC 3.2.2.22) |
Escherichia coli O157:H7 str. EC869 |
441 |
46223 |
Secreted autotransporter toxin Sat |
Escherichia sp. 1_1_43 |
455 |
75733 |
Vacuolating autotransporter toxin |
Escherichia coli O45:K1 (strain S88 / ExPEC) |
467 |
108729 |
Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72) |
Escherichia coli (strain 55989 / EAEC) |
476 |
108728 |
Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72) |
Escherichia coli (strain 55989 / EAEC) |
491 |
34992 |
Putative acyltransferase MchD |
Escherichia coli |
511 |
84309 |
Membrane protein |
Escherichia coli O26:H11 (strain 11368 / EHEC) |
You can get the excelformat:
Version 2 BGI ty-2482 annotation excel format
It has been set up a repository and a wiki for results of E coli Outbreak. This annotation can be found also there at:
Version 1
We have done the annotation of the genome sequenced by BGI (6-2-2011, http://www.bgisequence.com/eu/index.php?cID=194 ) and assembled with MIRA by Nick Loman (6-2-2011, http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ ).
Our system BG7 (Bacterial Genome annotation of Era7 Bioinformatics, http://www.slideshare.net/marina_manrique/bg7-a-new-system-for-bacterial-genome-annotation-designed-for-ngs-data ) predicts ORFs and annotates them based on fragments of similarity with Uniprot proteins.
In contrast to other annotation pipelines where finding ORFs is the first step followed by the annotation one, BG7 system first searches for protein similarity and then defines the ORF searching for start and stop signals. It is specifically designed for annotating prokaryotic genomes obtained with NGS data since it handles the principal errors of these technologies: false indels in homopolymer regions and substitutions. Annotation systems based on initial and exact ORF detection often may lose ORFs due to these kinds of sequencing errors that may lead to introduction or lack of stop codons and modification of start signals. BG7 is also designed to work with genomes fragmented in many contigs solving the problem of the detection of incomplete genes at the end of contigs. The system is especially suitable to detect rare genes similar to proteins from taxonomically distant organisms. BG7 takes advantage of cloud computing to perform extensive computing tasks in a reasonable time. The annotation of a 3Mb bacterial genome can be performed in less than 12 hours.
Our prediction of genes of the MIRA assembly of the e.coli EHEC responsible for the last European outbreak sequenced by BGI was based on a set of 137063 proteins composed by:
- All the representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
- All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
- All Uniprot proteins from bacteria including in any field the term “toxin”
- All Uniprot proteins from bacteria including in any field the term “hemolysin”
- All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae
june 5, 2011:
The Last annotation files are:
tagged annotation tables in excel format:
http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls
Annotation field description Pdf
preliminay annotation analysis in pdf format (genbank included):
Note: All tables and figures are included in the above PDF
PRELIMINARY RESULTS:
We have predicted 6327 genes, 6156 encoding proteins y 171 corresponding to ribosomal and tRNA.
Only 1326 out of the 6156 protein encoding genes have canonical start and stop codon and haven´t frame-shifts neither intragenic stop codons. 2479 protein encoding genes (out of the 6156 predicted) include some frameshift or some intragenic stop codon in their sequences, probably caused by inherent technology errors. However our system is tolerant to errors of massive sequencing technologies and it has been able to detect a rich set of genes even with very preliminary sequencing results.
Probably some of the proteins detected are fragmented and some of them could appear as two different predicted genes if they are in different contigs.
Taxonomic origin of proteins responsible of the prediction of the detected genes
We have analyzed the taxonomical origin of the proteins responsible of the prediction of the detected genes. See Table 1, Figure 1 and Figure 2.
The majority of the proteins responsible of the prediction of the detected genes belong to:
- Escherichia coli O26:H11 (strain 11368 / EHEC): 2810
- Escherichia coli (strain 55989 / EAEC): 1166
- Escherichia coli O44:H18 (strain 042 / EAEC): 339
- Escherichia coli O103:H2 (strain 12009 / EHEC): 296
- Escherichia coli: 221
- Escherichia coli O111:H- (strain 11128 / EHEC): 151
- Escherichia coli O157:H7 (strain EC4115 / EHEC): 148
- Escherichia coli O157:H7 (strain TW14359 / EHEC): 144
- Escherichia coli (strain K12): 51
- Salmonella typhi: 51
Fast manual annotation
Based on the preliminary results of our semi-automated method of annotation we have reviewed manually the annotation tagging the principal genes and functions. This preliminary tagging has been carried out analyzing the annotation for each predicted protein following specific interesting points: toxins, hemolysins, antibiotic resistance, pathogenicity, adhesion, plasmid, phage and other features.
We have selected and clustered genes with specific functions especially important from the human health perspective. The tagged genes are displayed in the following simplified tables. The complete annotation table is available at:
http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls
Toxins and peptidases
We have selected predicted proteins with annotations related with toxin function and some peptidases that could be related with toxin-like activity.
There are 33 predicted genes annotated as toxins and three that also could be toxins. Table 2 displays also genes encoding proteins with protease or peptidase activity. Some of them could act as toxin-like proteins.
These are the names of the 33 proteins annotated as toxins (See complete annotation):
- Putative acyltransferase MchD
- Toxin-antitoxin system, toxin component, PIN family
- Toxin-antitoxin system, antitoxin component, Xre family
- Predicted antitoxin of YafQ-DinJ toxin-antitoxin system
- Toxin of the YafQ-DinJ toxin-antitoxin system
- Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]
- Secreted autotransporter toxin Sat
- Serine protease pic (ShMu)
- Toxin-antitoxin system protein
- Vacuolating autotransporter toxin
- Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)
- Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)
- Membrane protein
- Toxin-antitoxin system, antitoxin component, Xre family
- Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)
- Putative shET2 enterotoxin
- Toxin ChpB of the ChpB-ChpS toxin-antitoxin system
- ShET2 enterotoxin, region
- Toxin-antitoxin system, antitoxin component, HicB family
- Toxin of the YoeB-YefM toxin-antitoxin system
- Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]
- Toxin-antitoxin system, antitoxin component, AbrB family
- Serine protease pic (ShMu)
- Toxin of the YeeV-YeeU toxin-antitoxin system
- Putative antitoxin
- Serine proteAse eata (EC 3.4.21.-)
- Shiga toxin II subunit B
- Shiga toxin subunit A (EC 3.2.2.22)
- Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]
- Toxin of the YeeV-YeeU toxin-antitoxin system
- Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]
- Small toxic membrane polypeptide
- Hok/gef cell toxic protein
- Serine proteAse eata (EC 3.4.21.-)
Hemolysins and heme metabolism related proteins
We have found three hemolysins:
- Hemolysin E, chromosomal
- Putative hemolysin expression modulating protein
- Channel protein, hemolysin III family
Table 3 displays the data about these three hemolysins and some proteins related to heme transport and metabolism.
Antibiotic resistance
We have selected genes involved in specific antibiotic resistance and also genes encoding efflux pumps and multidrug resistance proteins that could be involved in some additional antibiotic resistance capabilities of this strain.
We have found 31 predicted genes encoding specific antibiotic resistance. The proteins encoded by them are:
- Aminoglycoside resistance: C8TVG3 - Aminoglycoside/multidrug efflux system protein AcrD
- Macrolide resistance: C8TLZ5 - Fused macrolide transporter subunits of ABC superfamily: ATP-binding component/membrane component
- Macrolide resistance: C8TLZ4 - Macrolide transporter subunit, membrane fusion protein component
- Penicillin resistance:- B7LBI2 - Penicillin-insensitive murein endopeptidase (EC 3.4.24.-) (D-alanyl-D-alanine-endopeptidase) (DD-endopeptidase)
- Polymyxin resistance: B7LAS5 - Polymyxin resistance protein B
- Polymyxin resistance: D6I9J9 - Polymyxin resistance protein PmrM
- beta-lactam resistance: C8UQP5 - TEM-1 beta-lactamase
- Tetracycline resistance: D3H382 - Tetracycline resistance protein
- beta-lactam resistance: C8TJ28 - Regulator of penicillin binding proteins and beta-lactamase transcription
- Tetracycline resistance: C8TNP7 - Multidrug resistance protein mdtG
- Fosfomycin and deoxycholate resistance: C8TU05 - Multidrug resistance protein mdtA (Multidrug transporter mdtA)
- Novobiocin and deoxycholate resistance: C6V0N5 - Multidrug resistance protein MdtB (Multidrug transporter MdtB)
- Novobiocin and deoxycholate resistance: C8TU07 -Multidrug resistance protein MdtC (Multidrug transporter MdtC)
- Novobiocin and deoxycholate resistance: C8TU07 - Multidrug resistance protein MdtC (Multidrug transporter MdtC)
- chloramphenicol resistance: B7L855 - Multidrug resistance protein mdtL
- beta-lactam resistance: C8TNX4 - Beta-lactamase/D-alanine carboxypeptidase AmpC
- beta-lactam resistance: C8TIW8 - Beta-lactamase/D-alanine carboxypeptidase
- Beta-lactamic resistance: B7L5H7 Beta-lactam resistance membrane protein
- Beta-lactamic resistance: Q6BBP7 - Beta-lactamase (Beta-lactamase CTX-M-3) (CTX-M-3 extended-spectrum beta-lactamase) (Extended-spectrum class A beta-lactamase CTX-M-3)
- Beta-lactamic resistance: B2CD48 - Beta-lactamase TEM (Fragment)
- Bicyclomycin resistance: B7LAK6 - Bicyclomycin/multidrug efflux system
- Bicyclomycin resistance: C6V2S0 - Bicyclomycin/multidrug efflux system
Polymyxin resistance: C8TUQ2 - Bifunctional polymyxin resistance protein ArnA
norfloxacin and enoxacin resistance : B7LFZ9 - Multidrug resistance protein mdtH
6-mercaptopurine resistance: C8TLE9 - Purine ribonucleoside efflux pump nepI
Adhesion related, secretion system and pathogenicity and virulence related proteins
This strain has many genes involved in adhesion and pathogenicity. Some of them are collected in Table 5.
Mercuric resistance plasmid
This strain could bear a mercuric resistance plasmid. The predicted proteins included in Table 6 are all located in the same contig and probably are in a plasmid forming a functional operon with a MerR family regulator in a divergent orientation to the rest of the components of the operon.
Tellurium resistance
Table 7 collects the genes involved in Tellurium resistance
In addition to genes involved in Mercuric resistance and Tellurium resistance we have predicted and annotated in this genome many genes involved in resistance to other metals (See complete annotation)
Transposases
It seems that there are 121 putative transposases in this genome (See Table 8). It probably implies a high genomic plasticity and flexibility for adaptation to changing environments. In the next table we have collected some comments about these set of transposases of E. coli EHEC genome.
Plasmids
Around 246 predicted proteins appear to be related with plasmids. The tagged genome annotation table is available at:
http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls
This strain has many genetic capabilities that probably confer it a competitive advantage. Some important features that this strain bears in its genome:
- A restriction-modification system
- Many proteins involved in Fe transport and utilization. Siderophores: aerobactin, enterobactin.
- Lysozyme
- A general inhibitor of pancreatic serine proteases: inhibits chymotrypsin, trypsin, elastases, factor X, kallikrein as well as a variety of other proteases
- Proteins involved in anaerobic respiration
- Antimicrobial peptides
- Proteins involved in quorum-sensing and biofilm formation
- Proteins involved in Ni, Cu, Zn and Co resistance
- More than 170 phage proteins
June 3, 2011:
Annotation field description Pdf
Disclaimer:
We understand that the assembly obtained from http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly
and the sequences obtained from ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482 are not restricted in any way.
The information we are publishing here and in another web pages and sites must be used for research activities only and we do not guarantee the accuracy ot this information.
The data publishe here is preliminary and may contain errors.
Era7 Information technologies SLU provide these annotation data "as is" without any warranty express or limited, including warranty of merchantability ot fitness for a particular purpose or use.
Era7 Information Technologies SLU assumes NO legal liability or responsibility for any purposed for which the data are used.
You can use the data from this draft annotation and information provided that you attribute propperly the source and Authors. Copyright Era7 Information Technologies SLU 2011

E Coli genome draft annotation by Era7 Bioinformatics (Era7 Information Technologies SLU) is licensed under a Creative Commons Reconocimiento-NoComercial-CompartirIgual 3.0 Unported License.
Creado a partir de la obra en www.era7bioinformatics.com/en/E_Coli_EHEC_O104_STRAIN_EU_OUTBREAK_era7bioinformatics.html.