Approaches to prediction of protein structure

Protein structure prediction (PSP) is one of the most important and challenging problems in bioinformatics today. This is due to the fact that the biological function of the protein is determined by its structure. While there is a gap between the
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
     A Novel Methodology for Protein Secondary Structure Prediction Using Physics Principles Walaa Fathy Ahmed 1, a*  Walid Gomaa 2, b  1 Collage of Computing and Information Technology, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt 2 Department of Computer Science and Engineering,   Egypt-Japan University of Science and Technology, currently on leave from Alexandria University / Alexandria, Egypt a,  b Keywords :  protein structure prediction , ab initio, energy function, consensus prediction.   Abstract . Protein structure prediction (PSP) is one of the most important and challenging problems in  bioinformatics today. This is due to the fact that the biological function of protein is determined by its structure. This paper presents a novel methodology for protein secondary structure prediction using physical and chemical  properties. This methodology is very effective when the consensus prediction approach fails. In addition, it gives an energy value for each secondary structure conformation and selects the minimum one to build the tertiary structure. I.   I NTRODUCTION   Proteins carry out a wide variety of vital functions inside the living organisms. Hence, it is crucial to understand the different levels of structure, properties, and functionality of proteins. As structural elements, proteins are the main constituents of human bones, muscles, hair, skin, blood vessels, and antibodies. They also recognize invading elements and allow the immune system to get rid of the unwanted invaders. Drug design is one of the major financial driving forces behind protein biomedical research [1]. Protein formation passes through different levels of structure [2], the first one is the primary structure which is a necklace of amino acids. This ordered linear array of amino acids would then confer   local  regular conformational forms that constitute the secondary structure  of the protein. Chief elements of the secondary structures are α-helices, β-sheets, and random structures called loops or coils. The third level of protein structure is the tertiary structure  which refers to the overall three-dimensional arrangement of the various secondary structure elements. Finally, in few cases a complex protein  is formed as the arrangement of multiple polypeptide chains (that is, multiple different proteins).This kind of complex arrangement is called the quaternary structure of the protein .  Protein performs its function in its native state (the tertiary structure that has the minimum energy) [3, 4]. Protein secondary structure prediction refers to the prediction of the conformational state of each amino acid residue of a protein sequence. There are three possible conformational states: helices H, strands E, or coils C. Secondary structure prediction is an intermediate step in tertiary structure prediction. The prediction is based on the fact that secondary structures have regular arrangement of amino acids which are stabilized by hydrogen  bonding. Protein tertiary structure prediction with high accuracy is not a trivial task, and has been a very difficult  problem for decades. There are several categories of prediction methods: (1)ab initio(do novo) techniques which make use of the sequence information only, (2) homology methods which make use of multiple sequence alignment information, and (3) threading techniques which assume the preexistence of known similar (homologous) structures[5]. II. B IOLOGICAL B ACKGROUND   Protein is a polypeptide chain  that consists of a sequence of amino acid residues linked together by  peptide bonds. This chain forms the primary structure of the protein, see figure1. There are twenty kinds of amino acids that differ from each other by the side chain group R . Amino acids can be grouped into several categories based on the chemical and physical properties of the side chain. Such  properties include size and affinity to water. According to these two properties amino acids can be categorized as small vs. large and hydrophobic (H) (water disliking) vs. hydrophilic (P) (water liking).They can be further divided into aliphatic vs. aromatic. Generated by Foxit PDF Creator © Foxit Software For evaluation only.    Fig. 1. The condensation of two amino acids to form a peptide bond [4].  The ordered linear array of amino acids would then confer local  regular conformational forms that constitute the secondary structure of the protein. These local structures are stabilized by hydrogen bonds. The main components of the secondary structure are α-helices, β-sheets, and irregular structures called loops or coils.. The structure of α-helix repeats itself every 5.4 angstrom and helices have 3.6 amino acid residues per turn. In β- sheets structure the polypeptide does not form a coil. Instead, it zigzags in a more extended conformation than the α-helix. Loops are irregular structures characterized by sharp turns or hairpin-like structures. Residues in the loop or coil regions tend to be charged, polar, and located on the surface of the  protein structure. The tertiary structure  refers to the overall 3D (three-dimensional) arrangement of the various secondary structure elements. It can be described as the complete 3D assembly of all amino acids of a single polypeptide chain that is stabilized  by the hydrophobic interactions between the amino acids and other forces, such as disulfide bonds, that can be formed by cysteine’s and methionine's charges. As a result of these stabilizing forces, the residues that are far in the primary sequence may be close to each other in the tertiary structure; see figure3.Finally, the quaternary structure refers to the association of several  polypeptide chains (several proteins) into a complex protein , see figure 2. III. L ITERATURE REVIEW   Several software systems for protein secondary structure prediction have been developed such as GOR II, GOR III, GOR IV, and SOMP [1].These are the second generation prediction algorithms which have improved their levels of accuracy over the first generation by about 10%. PHD  [6] is a web-based program that combines neural networks with multiple sequence alignment. It first performs a BLAST of the query sequence. Then the resulting alignment in the form of a profile is fed into a neural network [7]. In the Predator  [1] algorithm prediction of the secondary structure is  based on local pairwise alignment of the sequence to be predicted with each related sequence rather than multiple alignments [8]. DSC [1] is another algorithm for protein secondary structure  prediction that uses linear regression in its implementation [8].   On the other hand the computational paradigms for predicting the  protein tertiary structure are generally divided into three categories [9]: homology (comparative), threading (fold recognition), and ab initio (de novo).The ab initio paradigm is solely based on the primary sequence information without the aid of known protein structures .Ab initio algorithms use the physical and chemical properties of the sequence’s amino acids. In Fig.2. Main protein structure  particular, the assumption that the native structure of a protein molecule takes on the lowest free energy state among all its possible alternative conformations. The Ab initio approach attracts many researchers since there are large number of proteins that have no homologous templates in the database, hence the other two approaches would fail. In addition there are some  proteins that have sequence similarity with preexisting templates, however, the template and target proteins have quite different structures. The main advantage of the ab initio approach is that it provides an understanding and an explanation of the process of protein folding. A vast number of ab initio algorithms have been developed in Generated by Foxit PDF Creator © Foxit Software For evaluation only.    the recent years. Two main performance indices need to be considered when designing such algorithms: quality (accurate prediction) and computational efficiency (fast execution) [9]. Some models use very general chemical and physical features of the protein. They are computationally efficient, however, not very accurate. On the other hand, other algorithms create an actual detailed simulation of the folding process (molecular dynamics). Though accurate, they have unacceptable running times. Some of the ab initio algorithms that have been used in practice include: Monte Carlo simulation [11,12,13], genetic algorithms [14,15], ant colony optimization [16,17,18,19], neural networks [19], and molecular dynamics [9].Some protein prediction systems use the same optimization algorithm but differ in the configuration and parameters of the chosen algorithm. Generally there are three key elements that affect the  performance of any ab initio algorithm: protein representation, energy function, and the search strategy used in the conformation space. In this paper we focus on the second criteria, energy function, and use it to refine  protein secondary structure prediction. IV.  THE P ROPOSED METHODOLOGY   The proposed methodology is divided into three phases and it outputs the best secondary conformation  based on the criterion of energy minimization, see figures 3, 4. The description of each phase is as follows. Phase 1(Formation of the energy database): The first step in the proposed methodology is a preparation  phase which is the calculation of the energy between each couple of amino acids. Since we have 20 kinds of amino acids, there are 400 results if we calculate all couples [14]. The results are obtained using two software: Avogadro which is used for drawing the molecules of the amino acids and simulate the bonds between them, as in vivo [20], and the other software, Mopac, is used for calculating the energy value between them [21].The results of the Mopac are given by electron volt unit, in order to transfer it to k\cal unit it is multiplied by 23.06. All these energy calculations are stored in a database file to be used later in the main algorithm. Phase 2(Secondary structure prediction): The purpose of this phase is to predict the secondary structure from its primary structure using different software packages: GORIII, HNNC, PHD, Predator, SOPM and MLRC. These software use different approaches, for example GORIII uses ab initio prediction while PHD uses homology with the aid of neural networks. The output from any of these software packages is a line of secondary structure elements which are h (helix), e (sheet), or c (coil). The result of this phase is a list of different possible conformations, or decoys. Phase 3(Tertiary structure prediction): Given the list of different possible secondary structures conformations from the second phase and the primary structure of the unknown protein, the best secondary structure conformation is selected as the one with the minimum energy [22, 23] The energy function is calculated using the following parameters: A- Disulfide bond:  It is a kind of strength to conformation. There are two amino acids that have sulfur atom that make that bond (Methionine (M) and Cysteine (C)), once that bond is formed the strength becomes very strong and has quite large effect on the structure, so that the bond can be formed between two adjacent amino acids. Fig.3. System architecture   Fig.4. The framework of the proposed algorithm   B- Side chain charges:  For two adjacent amino acids if they have opposite charges (positive and negative charges) an attraction will occur between them. And, if two adjacent amino acids have the same charge an anti–  Generated by Foxit PDF Creator © Foxit Software For evaluation only.    attraction between them occurs. So in both cases energy will be released. There are only five amino acids that have charges: Aspartic acid (D-), Glutamic acid (E-), Histidine (H+), Lysine (K+), Arginine (R+).   C- α-helix: In alpha helix structure there are 3.6 residues per turn and those amino acids located on the same side have the chance of interacting and releasing energy. That is, an interaction between the amino acids in α-helix should occur for amino acids with distance three or four in sequence. D- β-sheet:  For β-sheets the interaction should occur between the amino acids that belong to different β-strands. That is because β-sheet does not bend too much instead it zigzags forming sheet from strands, so the interaction occurs between any amino acid sulfur atom (M, C) located vertically, or between any vertically charged amino acids. The following is an outline of the algorithm that summarizes the proposed approach, followed by a more detailed description of the procedures involved.   Algorithm Input: (1) a target protein primary sequence, (2) a list of possible secondary structure conformations of the given protein in (1)(obtained by other tools/programs). Output:  the best conformation that gives the tertiary structure with minimum energy value. For each second structure conformation in the given input list do the following: Step 1: find pairs of amino acids containing combination of the C or M amino acid (disulfide bond). Step 2:  find all pairs of adjacent amino acids from the following list (D-,E-,R+,H+,K+). Step 3:  for each α helix find the corresponding pairs of amino acids with either (i) distance three or (ii) distance four in the primary sequence and choose the minimum value of (i) and (ii), then discard the other. Step 4:  for β strands find vertical pairs that have (C or M) (as in step 1 ) and/or (D, E, R, H, K) (as in step 2 ). Step 5:  get the energy value for all pairs collected from the previous four steps using the energy database obtained from Phase I (see above). Step 6:  sum the energy values for each possible conformation and select the one with minimum energy. Step 7:  evaluate the result and compare it with the real protein. -Pseudo code: Input: i) an amino acid sequence (primary) main structure M i) a list of secondary structures S Output: energy of each secondary structure Body: #find any amino acid pairs that contain combination of C or M amino acid (disulfide bond) Energy Total = 0; for i = 0 : M. length do if (search for combination of two (‘M, C’)) then energy Total+= get energy combination of Two (‘M,C’); end if next i; #find all pairs that contain two of the following amino acids (D-,E-,R+,H+,K+) for i = 0 : M. length do if (search for combination of two (‘D’, ‘E’, ‘R’, ‘H’, ‘K’)) then energy Total+= get energy combination of Two (‘D’, ‘E’, ‘R’, ‘H’, ‘K’); end if next i; # for α helix find pairs with distance of three or four amino acids in sequence and chooses the minimum value of the two possible interactions and discards the other (helix turn equal 3.6 residues) energy3=0; energy4 =0; for each structure S[K] do determine each α-helix in S[K] and the corresponding residues fromthe primary sequence; let L be the list of the initial locations (in M) of such residues; for i = 0 : L. length do energy3+= get Energy(L[i],L[i+3]); next i; Generated by Foxit PDF Creator © Foxit Software For evaluation only.    for i = 0 : M. length do energy4 += get Energy(L[i],L[i+4]); next i; if (energy3 >energy4) then energy Total += energy4; else energy Total += energy3; end if next K; # for β strands find pairs that have C or M (step 1) and/ or (D, E, R, H, K (step 2)) vertically  for each structure S[K] do determine each β-sheet in S[K] and the corresponding residues fromthe primary sequence; let L be the list of the initial locations (in M) of such residues; for i = 0 : L. length do apply step1 and step 2 for each pair in the strands of the β-sheet starting at M[L[i]]; energy Total+= energyStep1()+ energyStep2(); next K; The main idea of the algorithm is to iterate through the primary structure finding the amino acid segments that correspond to a given secondary structure conformation. This iteration is done once. And as we saw that all four steps of the algorithm depend mainly on the length of main structure ( n )  (amino acids) without nesting, so the complexity of the program is O ( n ). In the first step we search for C or M only, because the C and M amino acids have sulfur atom which means that they will interact with each other. In step 2, we search for D, E, R, H, or k because they are the only amino acids that have charge, whether positive or negative, which means that they will interact with each other. In step 3, we suppose an interaction in α-helix either every 3 or 4 amino acids in sequence because α-helix turns every 3.6 amino acids. In step 4, we search for C or M or one of the five charged amino acids. These would interact vertically because β-sheets do no0t bend too much, instead it zigzags forming sheet from stands. V.   R ESULTS AND DISCUSSION   The most commonly used measure for prediction accuracy (similarity measure) is the Q3 score , which is  based on the three states classification: helix, sheet, and coil. The score is a percentage of the residues of a  protein that are correctly predicted [24, 25, and 26]. The proposed approach is tested using three benchmark data sets. Table 1.Model results showing protein categories and similarity percentage   Protein name Protein category No, of residues Similarity percentage Energy value K/CAL 1ED0 α/β 46 100% -806112.41216 1IXA β 39 100 % -794025.66003 1A92 α 50 100% -2131843.31061 1ENH α 54 70,3% 1718576.51906 1A02 α 56 91,07% -2871857.94304 1B3A α/β 67 68.66% -1240158.0333 1APQ β 53 62.26% -1245623.16152 1AAL α/β 58 67, 24% - 997862.90942 1A1U α 35 74% -1203898.88659 6PTI α/β 58 70.68 % -1006876.66564 2KP8 α 72 93,05% -1678644.53885 2X04 α 80 81,25% -2043271.51389 1RPO α 65 93.84% -2199390.45182 1AQ5 α 47 76.59% -1591393.17714 1A0A α 63 79.36% -1893945.29122 1A3P β 45 62.22% -872368.90779 1KSR β 100 77% -776118.90707 Table 1 shows the model results using the most common proteins used by other researchers in the evaluation of protein tertiary structure prediction techniques. The fourth column ‘Similarity percentage’ indicates the performance of the proposed algorithm by comparing between the prediction of the proposed Generated by Foxit PDF Creator © Foxit Software For evaluation only.
Related Documents
View more
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks