Protein function prediction methods
From Proteinfunction.net
Contents |
Sequence based
The fundamental idea of protein function prediction is the detection of similar protein sequences by sequence database searching,assuming that similar sequences might have similar function. For this purpose, BLAST is the single best program known so far. But similar sequences not always have similar function and dissimilar sequences could have similar function at times. That means sequence space do not correspond with function space (Figure). Therefore rather than overall sequence similarity, locally important functional similarity finding is getting more important in protein function prediction.
Prediction with BLAST
BLAST(Basic Local Alignment Search Tool) finds the region of local similarity between sequences. NCBI-BLAST
BLAST for protein function prediction
Lots of protein function prediction methods take BLAST as a basic sequence database searching tool. The first step of protein function prediction is to limit the large size of protein sequence space to homologous sequences.
Craig E Jones et al. reported the protein function prediction using only BLAST . In the article, they found that using more than top 5 blast hit is not helpful for further increase in function annotation accuracy. And simple function annotation (here GO term is used for function annotation) with the best blast matched sequence is most accurate way of protein function prediction.
- Craig E Jones et al, Automated methods of predicting the function of biological sequences using GO and BLAST, BMC Bioinformatics 2005, 6:272
- Debnath Pal et al, Inference of protein function from protein structure, Structure, 2005, 13, 121
- Richard A. George et al, Effective function annotation through catalytic residue conservation, PNAS, 10,1073
- James C. Whisstock et al, Prediction of protein function from protein sequence and structure, Quarterly Reviews of Biophysics, 36, 307
Motif or domain information
Below section come from PROSITE document http://au.expasy.org/prosite/prosuser.html
In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we can say that "some regions of a protein sequence are more equal than others" !
The use of protein sequence patterns (or motifs) to determine the function(s) of proteins is becoming very rapidly one of the essential tools of sequence analysis. This reality has been recognized by many authors, as it can be illustrated from the following citations from two of the most well known experts of protein sequence analysis, R.F. Doolittle and A.M. Lesk:
"There are many short sequences that are often (but not always) diagnostics of certain binding properties or active sites. These can be set into a small subcollection and searched against your sequence".
"In some cases, the structure and function of an unknown protein which is too distantly related to any protein of known structure to detect its affinity by overall sequence alignment may be identified by its possession of a particular cluster of residues types classified as a motifs. The motifs, or templates, or fingerprints, arise because of particular requirements of binding sites that impose very tight constraint on the evolution of portions of a protein sequence."
Regular Expression
In computing, a regular expression (abbreviated as regexp or regex, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editers and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. (From Wikipedia) For more detail, See http://en.wikipedia.org/wiki/Regular_expression
see also Prosite
HMM model
Pfam Pfam is a collection of protein families and domains. Pfam contains multiple protein alignments and profile-HMMs of these families. Pfam is a semi-automatic protein family database, which aims to be comprehensive as well as accurate. http://www.sanger.ac.uk/Software/Pfam/
Pfam domain
These are regions of proteins that are predicted by the Pfam collection of hidden Markov models (HMMs) to belong to a family. These are strongly trusted matches to the family and are very unlikely to be false matches.
Important residue information
Catalytic residue conservation
Richard A.George et al suggested a function annotation method through catalytic residue conservation. Even high sequence similarity protein pairs, it's difficult to sure that the pair of proteins have the same function since protein function could be changed by a few of function determining residues.Therefore, Richard et al first constructed database define catalytic residues in protein sequences, Catalytic Sites Atla CSA. Based on defined catalytic residues in each protein sequence familiy, only proteins having conserved catalytic residues are annotated with proposed function.
- Effective function annotation through catalytic residue conservation, Richard A.George et al, PNAS, 02,35,12299
- The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data, Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton (2004) Nucl. Acids. Res. 32: D129-D133
- Analysis of Catalytic Residues in Enzyme Active Sites, Gail J. Bartlett, Craig T. Porter, Neera Borkakoti, and Janet M. Thornton (2002) J Mol Biol 324:105-121
- Using a Library of Structural Templates to Recognise Catalytic Sites and Explore their Evolution in Homologous Families, James W. Torrance, Gail J. Bartlett, Craig T. Porter, Janet M. Thornton (2005) J Mol Biol. 347:565-81
Feature extraction from sequence
While sequence-based approaches usually take use of a relationship with other protein sequences through sequence alignment,
a method presented here focuses on the protein sequence itself, not on the sequence alignment.
Jensen et al. reasoned that information about function should be contained in a spectrum of features of proteins, including secondary structure, post-translational modifications, protein sorting, and general properties of the amino-acid composition such as the isoelectric point. Using neural networks they predicted the following features from protein sequences, and correlated the results with functional classes.
Jensen, L. J. et al., Prediction of human protein function from post-translational modifications and localization features, Journal of Molecular Biology 319, 1257-1265 (2002)
Structure based
Structure based Protein function prediction using structure information is similar with sequence based prediction. The basic assumption is that proteins with similar structure might have similar function. Protein function is strongly related with its structure since protein functions by interacting with other proteins or chemicals and structure limits the possibility of its interaction modes. Moreover, structure similarity could fill the gap that is overlooked with sequence based method which can only detect close sequence similarity since structural similarity could be detectable from low sequence similar proteins.
Comparison of overall three-dimentional shape
The basic assumption of the method is that proteins with similar structure might have similar function. Protein function is strongly related with its structure since protein functions by interacting with other proteins or chemicals and structure limits the possibility of its interaction modes. Moreover, structure similarity could fill the gap that is overlooked with sequence based method which can only detect close sequence similarity since structural similarity could be detectable from low sequence similar proteins. Accordingly, the basic method to infer function using structure information is comparion of overall three-dimentional structures. Various useful tools such as CE], and well-classified protein structure database like SCOP, CATH]make possible to Comparison of overall three-dimentional shape.
Identification of functional local region
Identification of function-associated loop motifs
Protein loops(regions of irregular secondary structure such as helix and sheet) are known to play important roles in protein function. There has been no approach in which functions are expected only with the loop information. Here is a new method that used structurally classified protein loops for the purpose.
First, they utilized the pre-constructed loop DB (ArchDB) in which loop regions are classified according to their length and structural properties.
Once a protein loop family is determined, they found GO terms that can be representatives for that family : this is done by estimating the frequency of the GO terms found in the loop family. In this way, they assigned specific GO terms to all the loop families. These relations could help find putative proteins with the same function successfully, when it is used with BLAST.
Jordi Espadaler et al., Identification of function-associated loop motifs and application to protein function prediction, BIOINFORMATICS, 2006
Identification of functional sites
Scheme
Various algorithms to compare local structures
Prediction without structure alignment
As an accelerated speed of structural solving, structure solved proteins with unknown function are also increasing. Conventionally, most of protein function prediction methods using structural information are based on structural alignment. However, this kind of methods lose its reliability when similar structure is not found in existing structure database.
Paul D. Dobson et al. reported a novel prediction method using information from structure without alignment. They predict protein function (EC number) by constructing SVM based on information from structure such as secondary structure content, amino acid properties, surface properties and ligands of the structure.
Predicting enzyme class from protein structure without alignments, Paul D. Dobson et al, JMB,345,187
Protein-protein interaction(PPI) information
Protein-protein interaction(PPI) information based Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. These protein-protien interaction data can be used to predict protein function since it seems that if protien A and B interact both are functionally close. Currently, how well define the interaction between proteins is main issue.
About PPI Maps
To predict protein function via protein-protein interaction netwrok, first we should define how to construct PPI-Map. We use genetic and physical information both for PPI-Map. Thereby, we discover realtionship between proteins and find unknown protein function from neighborhood proteins or topology of PPI-Map.
For more details, see About PPI Maps <- click
Neighborhood counting
The simplest way of function prediction based on PPI map is assigning function of protein which is most prevalent among their interacting proteins (neighborhoods). Advanced version of this method is giving statistical value with function prediction. For exam, Kai square like score with (Nf-Ef)2/Ef where Nf is a number of neighborhood proteins with function f and Ef is a number of expected proteins with function f in neighborhoods based on frequency of function f in whole network of proteins.
FunctionalFlow method
This is a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. FunctionalFlow has improved performance of predicting the function of proteins with few (or no) annotated protein neighbors. FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Following is a diagram of FunctionalFlow algorithm described in paper.
Elena Nabieva et al., Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps, BIOINFORMATICS, 2005
Global mapping of unknown proteins
This method uses the Gene Ontology Index to relate diverse sources of experimental data by creation of an abstraction layer of evidence data. This abstraction layer is used as input to a neural network which, once trained, can be used to predict function from the evidence data of unannotated proteins. The method allows us to include almost any experimental data set related to protein function, which incorporates the Gene Ontology, to our evidence data in order to seek relationships between the different sets.
Xiong Jianghui et al., Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration, Directory of Open Access Journals(DOAJ), 2006
Finding cliques in protein interaction networks
Finding protein functional modules in protein interaction networks amounts to finding densely connected subgraphs. This technique proposes a method to identify cliques on weighted graphs. Using protein network from TAP-MS experiment on yeast, they discover a large number of cliques that are densely connected protein modules, with clear biological meanings as shown on Gene Ontology analysis.
Chris Ding et al., Finding Cliques in Protein Interaction Networks via Transitive Closure of a Weighted Graph
Evidence integration in PPI networks
[Reference]
This method takes two advantages, integration of evidence and propagation of integrated evidence. Protein of unknown function in PPI network would be annotated with most probable function which is the most frequent function among interacting proteins around it. The proteins with predicted functions at first stage is, at this time, used to be evidences to predict second stage proteins of unknown function. In this way, whole PPI network would be functionally annotated even though there are not enough evidence around protein of unknown function at first.
Ulas Karaoz et al., Whole-genome annotation by using evidence integration in functional-linkage networks, PNAS, 2004
Annotation frequency based
One of the main problem in protein function prediction methods is the annotation errors in source databases. One mistakenly annotation with a protein could be propagated through its own databases and databases based on the original databases and this makes false protein function prediction based on the source databases. To overcome this problem, there's an effort to predict protein function using every possible function annotation data sources. With this way, the annotation error in one data source could be corrected by the other true annotation data.
Total information combination method
If we use diverse information simultaneously, we could expect an improvement of performance. In this section, methods that utilize various information sources rather than focus on only one or a few approaches are introduced.


