Era7
Follow us: Twitter Linkedin

 

era7
BG7 for Bacterial Genomics Next Generation Sequencing projects

Escherichia coli EHEC Germany outbreak preliminary functional annotation using
BG7 system

Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja-Tobes, Eduardo Pareja, Raquel Tobes


Oh no sequences! Group. Era7 Bioinformatics.

TY-2482 genome

version 2

 

This is the automatic annotation of the second BGI assembly of the E. coli TY-2482 genome:

 

https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/assemblies/BGI/Escherichia_coli_TY-2482.contig.20110606.fa.gz

 

In this case BGI combined 200x of Illumina single-end reads and 12x of Ion Torrent. They have done a de novo assembly with Newbler v. 2.0.00.22, Soapdenovo v. 1.06 and AMOS minimus2 v. 1.59 getting finally 513 contigs.

 

For the automatic annotation we have used a set of 137,063 proteins that includes:

 

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

 

Preliminary

RESULTS:

 

It have been predict predicted 5,982 genes:

 

  • 5,849 protein encoding genes
  • 133 RNA genes (rRNA and tRNA)
  • 4,797 out of the 5,849 (82.01%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.
  • 658 out of the 5,849 (11.24%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences.

 

 

The genome has genes annotated as Restriction-modification:

  1. type I (an operon with 3 genes: Type I restriction enzyme EcoAI specificity protein (S protein) (S.EcoAI), and the subunits M and R)
  2. type II  (an operon with 2 genes similar to Shigella dysenteriae serotype 1,strain Sd197 proteins)
  3. type III (with 3 genes , one of them similar to  Type III restriction-modification system StyLTI enzyme mod (EC 2.1.1.72) form Salmonella tiphi)

 

Plasmids:

  1. The contigs 63 seems to belong to a plasmid
  2. The contig 74 probably is located in a plasmid and all their genes area involved in Mercury resistance
  3. The contig 98 is probably forming a part of a plasmid and contains SOS inhibition proteins
  4. The contig 503 is probably located in a plasmid and includes a gene annotated as Beta-lactamase (Beta-lactamase CTX-M-3) (CTX-M-3 extended-spectrum beta-lactamase) (Extended-spectrum class A beta-lactamase CTX-M-3). Several transposons and proteins involved in conjugation and plasmid maintenance are included.

 

There are 6 genes encoding adhesion AIDA-I and AIDA-I like proteins


The contig 485 contain a set of geens involved in Tellurium resistance


Secretion systems:

  1. a type VI secretion system in the contig 106 contains
  2. a cluster of 14 genes related to a type II secretion system  in the contig 122
  3. 4 genes epr (type III secretion system) in the contig 219

Big regions encoding genes involved in fimbria and flagella production
Several regions probably belonging to phages

 

Detected toxin genes:


Contig_id

Gen_id

Protein names

Organism

2

46657

Toxin-antitoxin system, toxin component, PIN family

Escherichia coli MS 185-1

36

35289

Toxin-antitoxin system, antitoxin component, AbrB family

Escherichia coli MS 187-1

41

63572

Toxin ChpB of the ChpB-ChpS toxin-antitoxin system

Escherichia coli O26:H11 (strain 11368 / EHEC)

45

40450

Toxin-antitoxin system, antitoxin component, Xre family

Escherichia coli MS 182-1

52

42760

Toxin-antitoxin system, antitoxin component, Xre family

Escherichia coli MS 16-3

65

66703

Small toxic membrane polypeptide

Escherichia coli O55:H7 (strain CB9615 / EPEC)

65

88671

Toxin ChpA

Escherichia coli O26:H11 (strain 11368 / EHEC)

67

54819

Toxin of the YafQ-DinJ toxin-antitoxin system

Escherichia coli (strain 55989 / EAEC)

67

66484

Predicted antitoxin of YafQ-DinJ toxin-antitoxin system

Escherichia coli O26:H11 (strain 11368 / EHEC)

70

78574

Hok/gef cell toxic protein

Escherichia coli (strain ATCC 55124 / KO11)

76

41767

Toxin-antitoxin system protein

Escherichia coli MS 107-1

91

68828

Antitoxin of the YoeB-YefM toxin-antitoxin system

Escherichia coli O1:K1 / APEC

91

69733

Toxin of the YoeB-YefM toxin-antitoxin system

Escherichia coli O26:H11 (strain 11368 / EHEC)

108

107031

Shiga toxin II subunit B

Escherichia coli O157:H7 (strain TW14359 / EHEC)

108

41845

Shiga toxin subunit A (EC 3.2.2.22)

Escherichia coli O157:H7 str. EC869

441

46223

Secreted autotransporter toxin Sat

Escherichia sp. 1_1_43

455

75733

Vacuolating autotransporter toxin

Escherichia coli O45:K1 (strain S88 / ExPEC)

467

108729

Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)

Escherichia coli (strain 55989 / EAEC)

476

108728

Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)

Escherichia coli (strain 55989 / EAEC)

491

34992

Putative acyltransferase MchD

Escherichia coli

511

84309

Membrane protein

Escherichia coli O26:H11 (strain 11368 / EHEC)

 

You can get the excelformat:

 

Version 2 BGI ty-2482 annotation excel format

 

It has been set up a repository and a wiki for results of E coli Outbreak. This annotation can be found also there at:

 

https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/TY2482/annotations/era7bioinformatics/BGI_V2

 

 

 

 

Version 1

 

We have done the annotation of the genome sequenced by BGI (6-2-2011, http://www.bgisequence.com/eu/index.php?cID=194 ) and assembled with MIRA by Nick Loman (6-2-2011,  http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ ).
Our system BG7 (Bacterial Genome annotation of Era7 Bioinformatics, http://www.slideshare.net/marina_manrique/bg7-a-new-system-for-bacterial-genome-annotation-designed-for-ngs-data ) predicts ORFs and annotates them based on fragments of similarity with Uniprot proteins.

 

In contrast to other annotation pipelines where finding ORFs is the first step followed by the annotation one, BG7 system first searches for protein similarity and then defines the ORF searching for start and stop signals. It is specifically designed for annotating prokaryotic genomes obtained with NGS data since it handles the principal errors of these technologies: false indels in homopolymer regions and substitutions. Annotation systems based on initial and exact ORF detection often may lose ORFs due to these kinds of sequencing errors that may lead to introduction or lack of stop codons and modification of start signals. BG7 is also designed to work with genomes fragmented in many contigs solving the problem of the detection of incomplete genes at the end of contigs. The system is especially suitable to detect rare genes similar to proteins from taxonomically distant organisms. BG7 takes advantage of cloud computing to perform extensive computing tasks in a reasonable time. The annotation of a 3Mb bacterial genome can be performed in less than 12 hours.

Our prediction of genes of the MIRA assembly of the e.coli EHEC responsible for the last European outbreak sequenced by BGI was based on a set of 137063 proteins composed by:

    - All the representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins

    - All Uniprot proteins from organisms including in their name the terms “EHEC”  or “EAEC”

    - All Uniprot proteins from bacteria including in any field the term “toxin”

    - All Uniprot proteins from bacteria including in any field the term “hemolysin”

    - All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

     

     

    june 5, 2011:

     

    The Last annotation files are:

     

    tagged annotation tables in excel format:

    http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls

     

    Annotation field description Pdf

     

    preliminay annotation analysis in pdf format (genbank included):

    http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.pdf

     

    Note: All tables and figures are included in the above PDF

     

    PRELIMINARY RESULTS:

     

    We have predicted 6327 genes, 6156 encoding proteins y 171 corresponding to ribosomal and tRNA.
    Only 1326 out of the 6156 protein encoding genes have canonical start and stop codon and haven´t frame-shifts neither intragenic stop codons. 2479 protein encoding genes (out of the 6156 predicted) include some frameshift or some intragenic stop codon in their sequences, probably caused by inherent technology errors. However our system is tolerant to errors of massive sequencing technologies and it has been able to detect a rich set of genes even with very preliminary sequencing results.
    Probably some of the proteins detected are fragmented and some of them could appear as two different predicted genes if they are in different contigs.
    Taxonomic origin of proteins responsible of the prediction of the detected genes
    We have analyzed the taxonomical origin of the proteins responsible of the prediction of the detected genes. See Table 1, Figure 1 and Figure 2.


    The majority of the proteins responsible of the prediction of the detected genes belong to:

     

              - Escherichia coli O26:H11 (strain 11368 / EHEC): 2810

              - Escherichia coli (strain 55989 / EAEC): 1166

              - Escherichia coli O44:H18 (strain 042 / EAEC): 339

              - Escherichia coli O103:H2 (strain 12009 / EHEC): 296

                      - Escherichia coli: 221

              - Escherichia coli O111:H- (strain 11128 / EHEC): 151

              - Escherichia coli O157:H7 (strain EC4115 / EHEC): 148

              - Escherichia coli O157:H7 (strain TW14359 / EHEC): 144

              - Escherichia coli (strain K12): 51

              - Salmonella typhi: 51

               

    Fast manual annotation


    Based on the preliminary results of our semi-automated method of annotation we have reviewed manually the annotation tagging the principal genes and functions. This preliminary tagging has been carried out analyzing the annotation for each predicted protein following specific interesting points: toxins, hemolysins, antibiotic resistance, pathogenicity, adhesion, plasmid, phage and other features.
    We have selected and clustered genes with specific functions especially important from the human health perspective.  The tagged genes are displayed in the following simplified tables. The complete annotation table is available at:
    http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls

     

    Toxins and peptidases

    We have selected predicted proteins with annotations related with toxin function and some peptidases that could be related with toxin-like activity.
    There are 33 predicted genes annotated as toxins and three that also could be toxins. Table 2 displays also genes encoding proteins with protease or peptidase activity. Some of them could act as toxin-like proteins.
    These are the names of the 33 proteins annotated as toxins (See complete annotation):

    - Putative acyltransferase MchD

    - Toxin-antitoxin system, toxin component, PIN family

    - Toxin-antitoxin system, antitoxin component, Xre family

    - Predicted antitoxin of YafQ-DinJ toxin-antitoxin system

    - Toxin of the YafQ-DinJ toxin-antitoxin system

    - Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]

    - Secreted autotransporter toxin Sat

    - Serine protease pic (ShMu)

    - Toxin-antitoxin system protein

    - Vacuolating autotransporter toxin

    - Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)

    - Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)

    - Membrane protein

    - Toxin-antitoxin system, antitoxin component, Xre family

    - Serine protease pet (Plasmid-encoded toxin pet) (EC 3.4.21.72)

    - Putative shET2 enterotoxin

    - Toxin ChpB of the ChpB-ChpS toxin-antitoxin system

    - ShET2 enterotoxin, region

    - Toxin-antitoxin system, antitoxin component, HicB family

    - Toxin of the YoeB-YefM toxin-antitoxin system

    - Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]

    - Toxin-antitoxin system, antitoxin component, AbrB family

    - Serine protease pic (ShMu)

    - Toxin of the YeeV-YeeU toxin-antitoxin system

    - Putative antitoxin

    - Serine proteAse eata (EC 3.4.21.-)

    - Shiga toxin II subunit B

    - Shiga toxin subunit A (EC 3.2.2.22)

    - Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]

    - Toxin of the YeeV-YeeU toxin-antitoxin system

    - Serine protease sepA autotransporter (EC 3.4.21.-) [Cleaved into: Serine protease sepA; Serine protease sepA translocator]

    - Small toxic membrane polypeptide

    - Hok/gef cell toxic protein

    - Serine proteAse eata (EC 3.4.21.-)

     

     
     Hemolysins and heme metabolism related proteins


    We have found three hemolysins:

    - Hemolysin E, chromosomal

    - Putative hemolysin expression modulating protein

    - Channel protein, hemolysin III family

    Table 3 displays the data about these three hemolysins and some proteins related to heme transport and metabolism.

     

     

    Antibiotic resistance


    We have selected genes involved in specific antibiotic resistance and also genes encoding efflux pumps and multidrug resistance proteins that could be involved in some additional antibiotic resistance capabilities of this strain.
    We have found 31 predicted genes encoding specific antibiotic resistance. The proteins encoded by them are:

    - Aminoglycoside resistance:  C8TVG3 - Aminoglycoside/multidrug efflux system protein AcrD

    - Macrolide resistance: C8TLZ5 - Fused macrolide transporter subunits of ABC superfamily: ATP-binding component/membrane component

    - Macrolide resistance: C8TLZ4 - Macrolide transporter subunit, membrane fusion protein component

    - Penicillin resistance:- B7LBI2 - Penicillin-insensitive murein endopeptidase (EC 3.4.24.-) (D-alanyl-D-alanine-endopeptidase) (DD-endopeptidase)

    - Polymyxin resistance: B7LAS5 - Polymyxin resistance protein B

    - Polymyxin resistance: D6I9J9 - Polymyxin resistance protein PmrM

    - beta-lactam resistance: C8UQP5 - TEM-1 beta-lactamase

    - Tetracycline resistance: D3H382 - Tetracycline resistance protein

    - beta-lactam resistance: C8TJ28 - Regulator of penicillin binding proteins and beta-lactamase transcription

    - Tetracycline resistance: C8TNP7 - Multidrug resistance protein mdtG

    - Fosfomycin and deoxycholate resistance: C8TU05 - Multidrug resistance protein mdtA (Multidrug transporter mdtA)

    - Novobiocin and deoxycholate resistance: C6V0N5 - Multidrug resistance protein MdtB (Multidrug transporter MdtB)

    - Novobiocin and deoxycholate resistance: C8TU07 -Multidrug resistance protein MdtC (Multidrug transporter MdtC)

    - Novobiocin and deoxycholate resistance: C8TU07 - Multidrug resistance protein MdtC (Multidrug transporter MdtC)

    - chloramphenicol resistance: B7L855 - Multidrug resistance protein mdtL

    - beta-lactam resistance: C8TNX4 -  Beta-lactamase/D-alanine carboxypeptidase AmpC

    - beta-lactam resistance: C8TIW8 - Beta-lactamase/D-alanine carboxypeptidase

    - Beta-lactamic resistance: B7L5H7 Beta-lactam resistance membrane protein

    - Beta-lactamic resistance: Q6BBP7 - Beta-lactamase (Beta-lactamase CTX-M-3) (CTX-M-3 extended-spectrum beta-lactamase) (Extended-spectrum class A beta-lactamase CTX-M-3)

    - Beta-lactamic resistance: B2CD48 - Beta-lactamase TEM (Fragment)

    - Bicyclomycin resistance: B7LAK6 - Bicyclomycin/multidrug efflux system

    - Bicyclomycin resistance: C6V2S0 - Bicyclomycin/multidrug efflux system

    Polymyxin resistance: C8TUQ2 - Bifunctional polymyxin resistance protein ArnA

    norfloxacin and enoxacin resistance         : B7LFZ9 - Multidrug resistance protein mdtH

    6-mercaptopurine resistance: C8TLE9 - Purine ribonucleoside efflux pump nepI

     
    Adhesion related, secretion system and pathogenicity and virulence related proteins


    This strain has many genes involved in adhesion and pathogenicity. Some of them are collected in Table 5.

     


    Mercuric resistance plasmid


    This strain could bear a mercuric resistance plasmid. The predicted proteins included in Table 6 are all located in the same contig and probably are in a plasmid forming a functional operon with a MerR family regulator in a divergent orientation to the rest of the components of the operon.

     


    Tellurium resistance


    Table 7 collects the genes involved in Tellurium resistance
    In addition to genes involved in Mercuric resistance and Tellurium resistance we have predicted and annotated in this genome many genes involved in resistance to other metals (See complete annotation)

     

     

    Transposases


    It seems that there are 121 putative transposases in this genome (See Table 8). It probably implies a high genomic plasticity and flexibility for adaptation to changing environments. In the next table we have collected some comments about these set of transposases of E. coli EHEC genome.

     


    Plasmids


    Around 246 predicted proteins appear to be related with plasmids. The tagged genome annotation table is available at:
    http://www.era7bioinformatics.com/docs/EHEC_E_COLI_GERMANY_OUTBREAK_Annotation_Era7_Bioinformatics_v1_5_6_2011.xls

    This strain has many genetic capabilities that probably confer it a competitive advantage. Some important features that this strain bears in its genome:

    - A restriction-modification system

    - Many proteins involved in Fe transport and utilization. Siderophores: aerobactin, enterobactin.

    - Lysozyme

    - A general inhibitor of pancreatic serine proteases: inhibits chymotrypsin, trypsin, elastases, factor X, kallikrein as well as a variety of other proteases

    - Proteins involved in anaerobic respiration

    - Antimicrobial peptides

    - Proteins involved in quorum-sensing and biofilm formation

    - Proteins involved in Ni, Cu, Zn and Co resistance

    - More than 170 phage proteins

     

     

     

    June 3, 2011:

     

    annotation in genbank format

     

    Annotation in excel format

     

    annotation in txt format

     

    annotation in gff format

     

    annotation in ebml format

     

    Annotation field description Pdf

     

     

    Disclaimer:

    We understand that the assembly obtained from http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly

    and the sequences obtained from ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482 are not restricted in any way.

     

    The information we are publishing here and in another web pages and sites must be used for research activities only and we do not guarantee the accuracy ot this information.

     

    The data publishe here is preliminary and may contain errors.

     

    Era7 Information technologies SLU provide these annotation data "as is" without any warranty express or limited, including warranty of merchantability ot fitness for a particular purpose or use.

     

    Era7 Information Technologies SLU assumes NO legal liability or responsibility for any purposed for which the data are used.

     

    You can use the data from this draft annotation and information provided that you attribute propperly the source and Authors. Copyright Era7 Information Technologies SLU 2011

    Licencia de Creative Commons
    E Coli genome draft annotation by Era7 Bioinformatics (Era7 Information Technologies SLU) is licensed under a Creative Commons Reconocimiento-NoComercial-CompartirIgual 3.0 Unported License.
    Creado a partir de la obra en www.era7bioinformatics.com/en/E_Coli_EHEC_O104_STRAIN_EU_OUTBREAK_era7bioinformatics.html.