Author Archives: Admin

Graduation on Feb 2017



2014.10 Social Event

2014 년 10월 소셜이벤트 주제는 풍암지구 채식뷔페에서의 채식음식 맛보기 체험이었습니다.

콩고기, 밀까스 등 기존에는 맛볼수 없었던 다양한 채식음식을 음미하면서 즐거운 소셜이벤트 시간을 보냈습니다.

아래는 단체사진!


Social Event on May 7, 2013

Dear , Lab members,

The social event of April will be held on May 7.
Actually, it should’ve been held oneday of last month, but it has been delayed due to midterm exam schedules and member’s busy working.
Finally, now i am pleased to announce that i will hold the social event on May 7.

The detailed plan is listed below :
2013-5-7 (Tues)

16:00 – 17:50
 The soccer game to enhance member’s health and sociality.
( Location : big football ground next to the main gate of school )
( we will play basketball if the number of participants is not enough. )
18:00 – 20:00  we will eat Sam-gyup-sal (Pork belly) for dinner, which is called ‘삼겹살’ in korean.
( for those who cannot eat pork, you can choose any other food. )

Participating is not your obligation. So, feel free not to join but, i hope all of you will join because it’s going to be fun.
If you want to participate, please prepare your clothes and 10,000 won.
One more thing, If you cannot join, please let me know by e-mailing me in advance.

Thank you very much for reading.
I am looking forward to seeing you on the ground soon.


CAM00078   CAM00077


Song, B. and Lee, H. (2012) Prioritizing Disease Genes by Integrating Domain Interactions and Disease Mutations in a Protein-Protein Interaction Network,IJICIC, 8(2), 1327-1338

Prioritizing Disease Genes by Integrating Domain Interactions and Disease Mutations in a Protein-Protein Interaction Network


Complex diseases such as cancer are involved in inter-relationship amongseveral genes, with protein-protein interaction networks being extensively studied in at-tempts to reveal the relationship between genes and diseases. Although these studies haveshown promising results for identifying disease genes, it is not systemically studied that aprotein functions differently depending on its interaction partners in the network since aprotein can have multiple functions. In this study, domains are considered as functionalunits of proteins and we investigate how disease-related mutations in domains can be usedto identify other disease genes in a domain-domain interaction network. We subsequentlypropose a computational method to predict disease genes based on the following two as-sumptions. The first assumption is that proteins closely interacting with known diseaseproteins in a protein interaction network are likely to be involved in the same disease.Second, although two proteins are in the same distance from known disease genes in aprotein interaction network, the protein interacting with known disease genes through adomain with mutation is more likely to be related to the disease than other proteins thatinteract through domains with no mutation. As a result, when the proposed approach isapplied to five diseases, it highly ranks disease-related genes compared to a model usingonly a protein interaction data set.

Azad, A., Shahid, S., Noman N., and Lee, H. (2011) Prediction of Plant Promoters Based on hexamers and Random Triplet Pair Analysis. Algorithms for Molecular Biology, 6:19 (IF: 2.80)

Prediction of Plant Promoters Based on hexamers and Random Triplet Pair Analysis

  • Author : A K M Azad, Saima Shahid, Nasimul Noman, Hyunju Lee
  • Published Date : 2011
  • Category : 
  • Place of publication : Algorithms for Molecular Biology



With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters.


In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot.


Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity.


We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.

Hur, Y. and Lee, H. (2011) Wavelet-based identification of DNA focal genomic aberrations from single nucleotide polymorphism arrays. BMC Bioinformatics, 12:146 (IF: 3.43)

Wavelet-based identification of DNA focal genomic aberrations from single nucleotide polymorphism arrays

  • Author : Youngmi Hur and Hyunju Lee
  • Published Date : 2011
  • Category : 
  • Place of publication : BMC Bioinformatics



Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs.


In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12).


Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets.

Oh, M., and Song, B., and Lee, H. (2010) CAM:web tool for combining arrayCGH and microarray gene expression data from multiple samples. Computers in Biology and Medicine, 40(9):781-785. (IF: 1.269)

CAM:web tool for combining arrayCGH and microarray gene expression data from multiple samples

  • Author : Mira Oh, Bongjun SongHyunju Lee
  • Published Date : 2010
  • Category : 
  • Place of publication : Computers in Biology and Medicine


We develop a web-based tool for Combining Array CGH copy number aberration data and Microarray gene expression data (CAM). This tool analyzes these two data sets from multiple samples to detect genes having both DNA copy number aberrations (CNAs) and gene expression changes. CAM provides several statistical methods for identifying CNAs, which are consistent across multiple samples. Identified CNAs and their correlated gene expression changes are then visualized along the chromosomes. As a result, CAM is a useful tool for identifying disease related genes when these two types of data sets are available. To illustrate the various analysis outputs of CAM, we subsequently provide ten sets of example data from seven cancer types.

Ko, S and Lee, H. (2009) Integrative approaches to the prediction of protein functions based on the feature selection. BMC Bioinformatics, 10:455 (IF: 3.78)

Integrative approaches to the prediction of protein functions based on the feature selection

  • Author : Seokha Ko and Hyunju Lee
  • Published Date : 2009
  • Category : 
  • Place of publication : BMC Bioinformatics



Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.


We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy.


In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.

Kim, S.S. and Lee, H. (2009) An Enhanced Dimension Reduction Approach for Microarray Gene Expression Data. Interdisciplinary Bio Central, 1(4):13.

An Enhanced Dimension Reduction Approach for Microarray Gene Expression Data

  • Author : Sung-suk Kim and Hyunju Lee
  • Published Date : 2009
  • Category : 
  • Place of publication : Interdisciplinary Bio Central


Introduction: To achieve the high classification accuracy in microarray gene expression data set, various classification methods using dimension reductions have been extensively studied. However, these methods still require the improvement. For example, many dimension reduction methods show different results depending on data and conditions.

Materials and Methods: Here, we introduce an enhancement concept that joins two dimension reduction methods of Partial Least Squares (PLS) and Minimum Average Variance Estimation (MAVE), which is called as an enhanced dimension reduction method. The PLS method generates the new transformed genes that include compressed information. Then, the MAVE method clusters samples in the same class and separate samples in other classes.

Results and Discussion: By applying this enhanced dimension reduction approach into two classification methods of Adaptive Network based on Fuzzy Inference System (ANFIS) and Support Vector Machine (SVM), the classification accuracies of nine cancer data sets are improved compared to using only the PLS dimension reduction approach.

Conclusion and Prospects: This study shows that the enhanced dimension reduction approach can be generally used for any classification method in order to obtain high classification quality for the data sets with the large number of features compared to the number of samples.

Pena-Castillo, L., Tasan, M., Myers, C. L., Lee, H., et al. (2008) A critical assessment of M. musculus gene function prediction using integrated genome evidence. Genome Biology,9 (Suppl 1):S2 (IF: 7.17)

A critical assessment of M. musculus gene function prediction using integrated genome evidence

  • Author : Lourdes Pena-Castilo, Murat Tasan, Chad L Myers, Hyunju Lee, et al.
  • Published Date : 2008
  • Category : 
  • Place of publication : Genome Biology



Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.


In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.


We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.