APORC Document Center: StruComp

More...

Evaluating protein similarity from coarse structures

Supplementary Programs and Materials

Reference#

Yong Wang, Ling-Yun Wu, Ji-Hong Zhang, Zhong-Wei Zhan, Xiang-Sun Zhang, and Luonan Chen. Evaluating protein similarity from coarse structures. IEEE/ACM Transactions on Computational Biology and Bioinformatics, in press.

Introduction#

After a briefly review of the importance, rising, status quo and trends of the protein structure comparison problem, a new method for the quantitative evaluation of protein structure similarity is developed in this paper. Combining the representation of protein structure from its global shape view as a set of faces of convex hull with the FDOD score scheme to measure the similarity between two distributions, the new approach used to measure the similarities between protein structures is simple to implement, computationally efficient and surprisingly fast. Numerical experiments are conducted between proteins in four different protein datasets to measure their structure similarities. Comparing with the standard SCOP classification rule, the application of the new approach can provide automatic, fast, almost correct measurement of similarity and classification of the proteins in the four datasets.

Extracting the input data from PDB file#

In our numerical test, The PDB file containing the Ca coordinate information must be processed carefully when multiple chains or structures are presented in a single coordinate entry. For the multiple chains case, we manually extract the wanted chain in the PDB file and name the protein as PDB entry code plus chain identifier. For the case with structures determined by NMR, there are many structures in a file. We only keep the MODEL 1 as what we wanted and still name it with the original PDB entry code. In this paper, all the protein identifiers follow these rules. In default, our SE program which extracts spherical expression information takes a PDB file as input and will only cares the coordinate data of the chain A in multiple chains case and MODEL 1 when multiple models case.

For example:

pdb1amk.ent
This protein has a single chain, so this file can be directly inputed to our program without any process.
pdb1qmp.ent
This is a multiple chains protein. There are four chains A, B, C, D. In our numerical test, this protein is decomposed into four file as follows by manual: 1qmpA.ent, 1qmpB.ent, 1qmpC.ent, 1qmpD.ent. If you directly input this file to our SE program, you will default get the spherical expression of chain A.
pdb1nin.ent
This protein has several models. We only need the Model 1 in this file in numerical experiment. If you directly input this file to our SE program, you will default get the spherical expression of Model 1.

Building SE platform for the comparison#

The first step for protein structure comparison is building the SE database. The SE program is in charge of extracting the face information of the smallest convex hull build from the backbone of protein and at the same time recording the energy corresponding to every face. Through the process of SE program, a new file characterized by .se is formed. In which the basic protein information, the normal vector for every face and the energy sequence also are included. This program is achieved by mixed langrage, the IO operation is programmed by C++ and the convex hull generating algorithm is coded by fortran 90. Now this program is running on the windows platform.

SE program #

Input: The PDB file in .pdb format or .ent format

Output: The smallest convex hull of the backbone of the protein and the energy information which is contained in the new file named .se

Platform: Windows

Usage: se.exe pdbfilename One parameter is followed after the command which is the pdb filename as input.
se.exe No parameter is followed, all the instructions are printed on the screen which guide user to interact with se program.

Example#

For example 1acj.ent, in the command line:

>se.exe 1acj.ent
>
>Database Code: 1ACJ
>Molecule name: ACETYLCHOLINESTERASE (E.C.3.1.1.7) COMPLEXED WITH TACRINE
>Classification: HYDROLASE(CARBOXYLIC ESTERASE)
>Number of Atoms: 4095
>Number of Ca Atoms: 528
>Get support planes ...: 19.844000 seconds
>Number of SP planes: 120
>
>Please input the PDB filename:    (then you can go on with the next protein)

After this procedure, a new file named 1acj.se

is generated.

Below the smallest convex hull is illustrated using the information of SE file.

The Format of SE file is as follows:

Database Code: 1ACJ                                                         
Molecule name: ACETYLCHOLINESTERASE (E.C.3.1.1.7) COMPLEXED WITH TACRINE    
Classification: HYDROLASE(CARBOXYLIC ESTERASE)                              
Number of Atoms: 4095                                                       
Number of Ca Atoms: 528                                                    
Number of SP planes: 120                                                      
( -0.472654, -0.061812, -0.879078 ) | -90.351741 | 18072.042804
( -0.449403, 0.672920, -0.587551 ) | -23.450722 | 16917.721193
( 0.546279, -0.283686, 0.788100 ) | -3.322047 | 16905.139282
( -0.677200, 0.674699, -0.293568 ) | -7.333534 | 16744.166155
( -0.340059, 0.684439, -0.644906 ) | -24.991328 | 16669.514355
( 0.924270, 0.368883, -0.098234 ) | -8.827974 | 16658.432922
...

The first part is simple information of protein 1acj, every data line include the normal vector (the first three number in the bracket), the interception of face and the energy corresponding to every face of the smallest convex hull.

In fact for every protein, the SE program need to be run only once. So a platform or a se database should be built for the next comparison and whole database scan. The reason to representation of protein as its smallest convex hull is originated from data compression. The figure below elucidate this effect clearly. And the bigger the protein size, the more clearly this effect.

Protein structure pairwise comparison#

When the first step for protein structure comparison is building the SE database, Then FSSComp program is in charge of comparing the face information of the two smallest convex hull build from the backbone of proteins and at the same time outputting the similarity score. Through the process of FSSComp program, a result file characterized by proteinIdA-proteinIdB.txt is formed. In which the comparison results are summed. The main steps of protein structure pairwise comparison are illustrated below. This program is achieved by mixed langrage, the IO operation is programmed by C++ and the convex hull generating algorithm is coded by fortran 90. Now this program is running on the windows platform.

FSS comparison program #

Input: The PDB file A and B in .pdb format or .ent format

Output: The similarity score between two proteins obtained by FDOD function. The comparison result is summed in the new file named IDA-IDB.txt

Platform: Windows

Usage: FSSCom.exe pdbfilenameA pdbfilenameB Two parameters are followed after the command which are the pdb filenames to be compared.
FSSCom.exe No parameter is followed, all the instructions are printed on the screen which guide user to interact with se program.

Example#

For example 1bj5.ent and 1uor.ent), after the SE program, two se file is obtained as the comparison program's inputs: 1bj5.se and 1uor.se.

In the command line:

>FSSComp.exe 1bj5.se 1uor.se
>
>The computation result is summed as follows:
>The static protein is 1BJ5, the number of CA is  582, the number of plane is 130:
>The rotate protein is 1UOR, the number of CA is  580, the number of plane is 103:
>The gap between the two protein's residue number is 2 
>The gap between the two protein's plane number is 27 
>The FSS distribution distance is 0.439332

After this procedure, a new file named 1bj5-1uor.txt is generated.

Below the comparison steps are illustrated:

Advantages: As mentioned above, the FSSCom program used to measure the similarity between two FSSs of SE expression of two proteins is easily achieved and runs very fast. With the two SE files as input, the final score computation almost needs no time. The fast speed of this approach really provides us the chance for fast similarity search in the vast database today.

The protein datasets and clustering results#

With the preparation of the SE program and the FSSCom program for pairwise comparison, the last thing needs to discuss is the clustering methods, through which to visually inspect the resulting distance matrices of dataset formed by pairwise comparison using the FSSCom program. The PHYLIP (Phylogeny Inference Package) package is used in our paper to analyze the distance matrices. There are three method are supplied in it for clustering. We choose the KITSCH (Fitch-Margoliash and Least Squares Methods with Evolutionary Clock) algorithm. And all the figures of phylogeny tree are plotted using the software TREEVIEW.

The clustering steps#

Step 1 Prepare data: Choose a protein dataset and Extract from each PDB file the chain wanted.
Step 2 Build SE database: Produce the spherical expression for each of the PDB files in the dataset. Then the information about for every face of the smallest convex hull including the normal vector, intercept and FSS is store in a plane file as designed format.
Step 3 For each pair of protein PA; PB, compute FDOD(PA; PB) using FDOD function to obtain the similarity distance between them according to the FSS sequence information stored their SE files.
Step 4 Recording all the pairwise score in a distance matrix and at the same time to mark all the proteins in the dataset with their properties in biology.
Step 5 Using the distance matrix as input of the clustering software, the output file and the treefile is generated.
Step 6 To View the tree clearly, an phylogeny tree plotting software is applied.

Some tools used#

PHYLIP#

Here is a copy of program PHYLIP, or you can get it from the following URL:

http://evolution.genetics.washington.edu/phylip/software.html

Also a copy of FAQ of PHYLIP.

In fact mainly the KITSCH program is used here.

TreeView#

Here is a copy of program TreeView to draw the phylogeny tree from the tree-files generated by the clustering methods. This is a version on windows platform, other versions can be found in the following site:

http://taxonomy.zoology.gla.ac.uk/rod/treeview/

The protein datasets and computational results#

The following four data sets were used in the paper:

Leluk-Konieczny-Roterman data set #

@article{LelKonRot03,
  author = {J.Leluk and L.Konieczny and I.Roterman},
  year = 2003,
  title = {Search for Structural Similarity in Proteins},
  journal = {Bioinformatics},
  volume = 19,
  number = 1,
  pages = {117-124}
}

@article{Krasnogor04a,
author={N. Krasnogor and D. A. Pelta },
title={Measuring the similarity of protein structures by means of the Universal
Similarity Metric },
journal={Bioinformatics},
volume={20},
year={2004},
number={7}
}

David-Iosif data set #

@article{Bostick03a,
author={David Bostick and Iosif I. Vaisman},
title={A new topological method to measure protein structure similarity},
journal={Biochemical and Biophysical Research Communications},
volume={304},
year={2003},
pages={320-325}
}

Chew-Kedem data set #

@INPROCEEDINGS{CheKed02,
author="L.P. Chew and K.Kedem",
title="Finding Consensus Shape for a Protein Family",
booktitle="18th ACM Symp. on Computational Geometry. Barcelona, Spain",
year="2002"
}

@article{Krasnogor04a,
author={N. Krasnogor and D. A. Pelta },
title={Measuring the similarity of protein structures by means of the Universal
Similarity Metric },
journal={Bioinformatics},
volume={20},
year={2004},
number={7}
}

(reduced) Skolnick data set #

@INPROCEEDINGS{CapLan2002,
author="A.Caprara and G.Lancia",
title="Structural Alignment of Large-Size Proteins via Lagrangian Relaxation",
booktitle="Proceedings of RECOMB 2002",
year="2002",
paes="100-108",
publisher="ACM"
}

@article{Krasnogor04a,
author={N. Krasnogor and D. A. Pelta },
title={Measuring the similarity of protein structures by means of the Universal
Similarity Metric },
journal={Bioinformatics},
volume={20},
year={2004},
number={7}
}

Exploring the structure classification of SARS coronavirus#

The introduction of the SARS Dataset#

The epidemic of severe acute respiratory syndrome (SARS) is an atypical, highly contagious pneumonia. Paper YangHaitao03a summarized that SARS coronavirus had affected 32 countries in the period from February to June, 2003. In total, almost 8,500 people were infected and >900 died from the disease. Also in paper Marra03ait is stated that the severity of this disease is such that the mortality rate appears to be 3 to 6%, and can be as high as 43 to 55% in people older than 60 years.

After the SARS coronavirus is named publicly by the World Health Organization and member laboratories as the "SARS virus", a number of researches worldwide began to undertake the identification of the causative agent. The important results appeared in papers Marra03a, Rota03a which showed that from genome organization SARS coronavirus is similar to that of other coronaviruses. While phylogenetic analyses and sequence comparisons showed that SARS-CoV is not closely related to any of the previously characterized coronaviruses.

Since the protein structure information is more conservative than protein sequence and relate more closely with the protein function especially when the anti-SARS drug design is involved, many laboratories in the world try to determine the three dimensional structure of SARS main protein. In paper Anand03a, a predicted structure of SARS coronavirus protease is constructed through homology model and is deposited in PDB as entry 1p9t. Later, the crystal structures of the SARS-CoV main protease at different pH values and in complex with a specific inhibitor is reported in YangHaitao03a. In this seciton we weil apply our new protein structure comparison technique to exploring the structure classification of SARS coronavirus. To our best knowledge this is the first time to study the classification of SARS main protein from structure comparison view.

To collect the structure data of coronaviruses, we searched the whole Protein Data Bank with the key word "coronaviruses". Our query found 14 structures in the current PDB release. There are 9 entries of them including the theoretical entry 1p9t relating to the Hydrolase function as listed in the following table. To be noted that, all the protein except the 1p9t have multi chains. As described above in section we constructed a SARS coronaviruses dataset through extracting every chain from their PDB data files and named them as PDB ID plus their chain ID respectively. So the dataset is composed by the following 25 entries: 1LVOA, 1LVOB, 1LVOC, 1LVOD, 1LVOE, 1LVOF, 1P9SA, 1P9SB, 1P9UA, 1P9UB, 1P9UC, 1P9UD, 1P9UE, 1P9UF, 1Q2WA, 1Q2WB, 1P9T, 1UJ1A, 1UJ1B, 1UK2A, 1UK2B, 1UK3A, 1UK3B, 1UK4A, 1UK4B.

Materials#

Software#

The clustering methods used are:
The TREEVIEW program to plot the trees

Other materials#

In our numerical test, we give some analysis of the new similarity measure between two protein structures. These results includes the effect of the data compression, the relationship of the new similarity score with the RMSD score in FSSP by Dali method and the relationship of the similarity score with the protein size. Now the materials needed to supply these are listed in this part.

The data compression of the Spherical Expression #

Excel file which record data of protein size and SE number

There are APORC Document Center : StruComp - Plugin insertion failed: Could not find plugin org.goodjava.plugin.hitcounter.HitCounterAPORC Document Center : StruComp - Plugin insertion failed: Could not find plugin org.goodjava.plugin.hitcounter.HitCounter visitors since January 18, 2009