Category Archives: Publications

Ko, S and Lee, H. (2009) Integrative approaches to the prediction of protein functions based on the feature selection. BMC Bioinformatics, 10:455 (IF: 3.78)

Integrative approaches to the prediction of protein functions based on the feature selection

  • Author : Seokha Ko and Hyunju Lee
  • Published Date : 2009
  • Category : 
  • Place of publication : BMC Bioinformatics

Abstract

Background

Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.

Results

We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy.

Conclusions

In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.

Kim, S.S. and Lee, H. (2009) An Enhanced Dimension Reduction Approach for Microarray Gene Expression Data. Interdisciplinary Bio Central, 1(4):13.

An Enhanced Dimension Reduction Approach for Microarray Gene Expression Data

  • Author : Sung-suk Kim and Hyunju Lee
  • Published Date : 2009
  • Category : 
  • Place of publication : Interdisciplinary Bio Central

Abstract

Introduction: To achieve the high classification accuracy in microarray gene expression data set, various classification methods using dimension reductions have been extensively studied. However, these methods still require the improvement. For example, many dimension reduction methods show different results depending on data and conditions.

Materials and Methods: Here, we introduce an enhancement concept that joins two dimension reduction methods of Partial Least Squares (PLS) and Minimum Average Variance Estimation (MAVE), which is called as an enhanced dimension reduction method. The PLS method generates the new transformed genes that include compressed information. Then, the MAVE method clusters samples in the same class and separate samples in other classes.

Results and Discussion: By applying this enhanced dimension reduction approach into two classification methods of Adaptive Network based on Fuzzy Inference System (ANFIS) and Support Vector Machine (SVM), the classification accuracies of nine cancer data sets are improved compared to using only the PLS dimension reduction approach.

Conclusion and Prospects: This study shows that the enhanced dimension reduction approach can be generally used for any classification method in order to obtain high classification quality for the data sets with the large number of features compared to the number of samples.

Pena-Castillo, L., Tasan, M., Myers, C. L., Lee, H., et al. (2008) A critical assessment of M. musculus gene function prediction using integrated genome evidence. Genome Biology,9 (Suppl 1):S2 (IF: 7.17)

A critical assessment of M. musculus gene function prediction using integrated genome evidence

  • Author : Lourdes Pena-Castilo, Murat Tasan, Chad L Myers, Hyunju Lee, et al.
  • Published Date : 2008
  • Category : 
  • Place of publication : Genome Biology

Abstract

Background:

Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.

Results:

In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.

Conclusion:

We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Lee, H., Kong, S., and Park, P. K. (2008) Integrative analysis reveals the direct and indirect interactions between DNA copy number aberrations and gene expression changes. Bioinformatics, 24(7):889-896. (IF:4.894)

Integrative analysis reveals the direct and indirect interactions between DNA copy number aberrations and gene expression changes

  • Author : Hyunju Lee , Sek Won Kong, and Peter J. Park
  • Published Date : 2008
  • Category : 
  • Place of publication : Bioinformatics

Abstract

Motivation: DNA copy number aberrations (CNAs) and gene expression (GE) changes provide valuable information for studying chromosomal instability and its consequences in cancer. While it is clear that the structural aberrations and the transcript levels are intertwined, their relationship is more complex and subtle than initially suspected. Most studies so far have focused on how a CNA affects the expression levels of those genes contained within that CNA.

Results: To better understand the impact of CNAs on expression, we investigated the correlation of each CNA to all other genes in the genome. The correlations are computed over multiple patients that have both expression and copy number measurements in brain, bladder and breast cancer data sets. We find that a CNA has a direct impact on the gene amplified or deleted, but it also has a broad, indirect impact elsewhere. To identify a set of CNAs that is coordinately associated with the expression changes of a set of genes, we used a biclustering algorithm on the correlation matrix. For each of the three cancer types examined, the aberrations in several loci are associated with cancer-type specific biological pathways that have been described in the literature: CNAs of chromosome (chr) 7p13 were significantly correlated with epidermal growth factor receptor signaling pathway in glioblastoma multiforme, chr 13q with NF-kappaB cascades in bladder cancer, and chr 11p with Reck pathway in breast cancer. In all three data sets, gene sets related to cell cycle/division such as M phase, DNA replication and cell division were also associated with CNAs. Our results suggest that CNAs are both directly and indirectly correlated with changes in expression and that it is beneficial to examine the indirect effects of CNAs.

Ma, X.*, Lee, H.*, Li, W., and Sun, F. (2007) CGI: a new approach for prioritizing Genes by Combining Gene Expression and Protein-Protein Interactions. Bioinformatics, 23(2): 215-221. (* These two authors contribute equally.) (IF: 4.894)

CGI: a new approach for prioritizing Genes by Combining Gene Expression and Protein- Protein Interactions

  • Author : Xiaotu Ma, Hyunju Lee , Li Wang, and Fengzhu Sun 
  • Published Date : 2007
  • Category : 
  • Place of publication : Bioinformatics

Abstract

Motivation: Identifying candidate genes associated with a given phenotype or trait is an important problem in biological and biomedical studies. Prioritizing genes based on the accumulated information from several data sources is of fundamental importance. Several integrative methods have been developed when a set of candidate genes for the phenotype is available. However, how to prioritize genes for phenotypes when no candidates are available is still a challenging problem.

Results: We develop a new method for prioritizing genes associated with a phenotype by Combining Gene expression and protein Interaction data (CGI). The method is applied to yeast gene expression data sets in combination with protein interaction data sets of varying reliability. We found that our method outperforms the intuitive prioritizing method of using either gene expression data or protein interaction data only and a recent gene ranking algorithm GeneRank. We then apply our method to prioritize genes for Alzheimer’s disease.

 

 

Lee, H., Deng, M., Sun, F., and Chen, T. (2006) An Integrated Approach to the Prediction of Domain-Domain Interactions. BMC Bioinformatics, 7:269. (IF:3.62)

An Integrated Approach to the Prediction of Domain-Domain Interactions

  • Author : Hyunju Lee , Minghua Deng, Fengzhu Sun, Thing Chen
  • Published Date : 2006
  • Category : 
  • Place of publication : BMC Bioinformatics

Abstract

Background
The development of high-throughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying domain interactions within protein interactions through an integration of these biological data sets.

Results
We combine protein interaction data sets from multiple species, molecular sequences, and gene ontology to construct a set of high-confidence domain-domain interactions. First, we propose a new measure, the expected number of interactions for each pair of domains, to score domain interactions based on protein interaction data in one species and show that it has similar performance as the E-value defined by Riley et al. [1]. Our new measure is applied to the protein interaction data sets from yeast, worm, fruitfly and humans. Second, information on pairs of domains that coexist in known proteins and on pairs of domains with the same gene ontology function annotations are incorporated to construct a high-confidence set of domain-domain interactions using a Bayesian approach. Finally, we evaluate the set of domain-domain interactions by comparing predicted domain interactions with those defined in iPfam database [2,3] that were derived based on protein structures. The accuracy of predicted domain interactions are also confirmed by comparing with experimentally obtained domain interactions from H. pylori [4]. As a result, a total of 2,391 high-confidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes.

Conclusion
Our study shows that integration of multiple biological data sets based on the Bayesian approach provides a reliable framework to predict domain interactions. By integrating multiple data sources, the coverage and accuracy of predicted domain interactions can be significantly increased.

Lee, H., Tu, Z., Deng, M., Sun, F., and Chen, T. (2006) Diffusion Kernel-Based Logistic Regression Models for Protein Function Prediction. OMICS: A Journal of Integrative Biology, 10(1): 40-55. (IF: 2.056)

Diffusion Kernel-Based Logistic Regression Models for Protein Function Prediction

  • Author : Hyunju Lee , Zhidong Tu, Minghua Deng, Fengzhu Sun, Ting Chen
  • Published Date : 2006
  • Category : 
  • Place of publication : A Journal of Integrative Biology

Abstract

Assigning functions to unknown proteins is one of the most important problems in proteomics. Several approaches have used protein–protein interaction data to predict protein functions. We previously developed a Markov random fields (MRF) based method to infer a protein’s functions using protein-protein interaction data and the functional annotations of its protein interaction partners. In the original model, only direct interactions were considered and each function was considered separately. In this study, we develop a new model which extends direct interactions to all neighboring proteins, and one function to multiple functions. The goal is to understand a protein’s function based on information on all the neighboring proteins in the interaction network. We first developed a novel kernel logistic regression (KLR) method based on diffusion kernels for protein interaction networks. The diffusion kernels provide means to incorporate all neighbors of proteins in the network. Second, we identified a set of functions that are highly correlated with the function of interest, referred to as the correlated functions, using the chi-square test. Third, the correlated functions were incorporated into our new KLR model. Fourth, we extended our model by incorporating multiple biological data sources such as protein domains, protein complexes, and gene expressions by converting them into networks. We showed that the KLR approach of incorporating all protein neighbors significantly improved the accuracy of protein function predictions over the MRF model. The incorporation of multiple data sets also improved prediction accuracy. The prediction accuracy is comparable to another protein function classifier based on the support vector machine (SVM), using a diffusion kernel. The advantages of the KLR model include its simplicity as well as its ability to explore the contribution of neighbors to the functions of proteins of interest.

Lee, H., Deng, M., Sun, F., and Chen, T. (2005) Assessment of the Reliability of Protein-Protein Interactions Using Protein Localization and Gene Expression Data. BIOINFO 2005, Busan, Korea.

Assessment of the Reliability of Protein-Protein Interactions Using Protein Localization and Gene Expression Data

  • Author : Hyunju Lee , Minghua Deng, Fengzhu Sun, Ting Chen
  • Published Date : 2005
  • Category : 
  • Place of publication : Journal of Systems and Software

Abstract

Estimating the reliability of protein-protein interaction data sets obtained by high-throughput technologies such as yeast two hybrid assays and mass spectrometry is of great importance. We have developed a maximum likelihood estimation (MLE) method that integrates both protein localization data and gene expression data to estimate the reliability of protein interaction data sets. Through the integration, we obtain more accurate estimates of the reliability of various protein interaction data sets.

Lee, H., Lee, S., and Kim, H. (2005) Design and implementation of an extended relationship semantics in an ODMG-compliant OODBMS. Journal of Systems and Software, 76(2): 171-180. (IF:0.562)

Design and implementation of an extended relationship semantics in an ODMG-compliant OODBMS

  • Author : Hyunju Lee , Sangwon Lee, Hyoungjoo Kim
  • Published Date : 2005
  • Category : ODMG
  • Place of publication : Journal of Systems and Software

Abstract

Relationships, in addition to entities, are important in real-world database modeling. In particular, many object oriented database applications including CAD/CAM, CASE and multi-media need to model various and complex relationships, especially the ‘part–whole’ relationship. Without the built-in relationship supports from DBMSs, there is a huge overhead in managing relationships from application development to maintenance, since the relationships should be hard coded within the application program itself. In this paper, we propose a powerful ‘part–whole’ relationship model, which naturally extends the ODMG-3.0 object database standard. The proposed relationship model can support almost all of the relationship functionalities existing in the contemporary relational database model and the object oriented data model. In order to design and implement this relationship model, we seamlessly extend the ODMG-3.0 relationship using the inheritance concept. Also, we identify several possible run-time anomalies in implementing the relationship and provide solutions for their problems.