Category Archives: Publications

Bayabaatar Amgalan, Ider Tseveendorj, Hyunju Lee (2018) An Integrative Model for the Identification of Key Players of Cancer Networks. Applied Mathematical Modelling, 2018 June 01, 58:65-75. (JCR 2016: 19/100, 19%, MATHEMATICS, INTERDISCIPLINARY APPLICATIONS)

An Integrative Model for the Identification of Key Players of Cancer Networks.

  • Author : Bayabaatar Amgalan and Hyunju Lee
  • Published Date : 2018
  • Category : Bioinformatics
  • Place of publication : Applied Mathematical Modelling

 

Abstract

Uncovering miscoordination in a biological network is essential for the understanding of cellular malfunctions in cancer. Integrative analysis across multiple cellular levels may provide an opportunity to elucidate the miscoordination between the regulatory mechanisms in cancer cells.

Here, we propose an integrative model for the identification of key players of the cancer-activated Multi-Type Interaction (MTI) gene network (KPOCN). To measure the functional associations between genes, using DNA copy number aberrations (CNAs) and gene expressions (GEs), we constructed three interacting weighted graphs: GEs affected by CNAs, CNAs by CNAs, and GEs by GEs. These three weighted graphs were mapped onto a single graph, in order to construct a MTI gene network by using their optimal combination. Finally, the effect of a single gene was determined by using the centrality and betweenness of node scores in the MTI network.

We first tested KPOCN using simulated datasets, and afterward, we applied this model to the real breast cancer datasets. KPOCN was shown to identify successfully key regulators with their corresponding response variables (targets) when using the simulated data, and identified well-known breast cancer oncogenes. These results demonstrated that our model can be used for an efficient identification of key genes that affect cancer development. Source codes are available at http://gcancer.org/KPOCN.

Ho Jang and Hyunju Lee (2018) Identification of cancer driver genes in focal genomic aberrations from whole-exome sequencing data. Bioinformatics, 2018 Feb 1;34(3):519-521. (IF: 7.307) (JCR 2016: 2/57, 3.5%, MATHEMATICAL & COMPUTATIONAL BIOLOGY)

Identification of cancer driver genes in focal genomic aberrations from whole-exome sequencing data.

  • Author : Ho Jang and Hyunju Lee
  • Published Date : 2018
  • Category : Bioinformatics
  • Place of publication : Bioinformatics

 

Abstract

Summary:

Whole-exome sequencing (WES) data have been used for identifying copy number aberrations in cancer cells. Nonetheless, the use of WES is still challenging for identification of focal aberrant regions in multiple samples that may contain cancer driver genes. In this study, we developed a wavelet-based method for identifying focal genomic aberrant regions in the WES data from cancer cells (WIFA-X). When we applied WIFA-X to glioblastoma multiforme and lung adenocarcinoma datasets, WIFA-X outperformed other approaches on identifying cancer driver genes.

Availability:

R source code is available at http://gcancer.org/wifax.

Hyejin Cho, Wonjun Choi and Hyunju Lee (2017) A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinformatics, 13 October 2017;18(1):451. (IF: 2.448) (JCR 2016: 10/57, 17.5%, MATHEMATICAL & COMPUTATIONAL BIOLOGY)

A method for named entity normalization in biomedical articles: application to diseases and plants.

 

Abstract

Background: In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine’s Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems.

Results: In this study, we propose an approach for normalizing biological entities, such as disease names and plant names, by using word embeddings to represent semantic spaces. For diseases, training data from the National Center for Biotechnology Information (NCBI) disease corpus and unlabeled data from PubMed abstracts were used to construct word representations. For plants, a training corpus that we manually constructed and unlabeled PubMed abstracts were used to represent word vectors. We showed that the proposed approach performed better than the use of only the training corpus or only the unlabeled data and showed that the normalization accuracy was improved by using our model even when the dictionaries were not comprehensive. We obtained F-scores of 0.808 and 0.690 for normalizing the NCBI disease corpus and manually constructed plant corpus, respectively. We further evaluated our approach using a data set in the disease normalization task of the BioCreative V challenge. When only the disease corpus was used as a dictionary, our approach significantly outperformed the best system of the task.

Conclusions: The proposed approach shows robust performance for normalizing biological entities. The manually constructed plant corpus and the proposed model are available at http://gcancer.org/plant and http://gcancer.org/normalization, respectively.

Yeonghun Lee, Sehhoon Park, Se-Hoon Lee$, Hyunju Lee$ (2017) Characterization of Genetic Aberrations in a Single Case of Metastatic Thymic Adenocarcinoma. BMC Cancer, 2017 May 15;17(1):330. (IF: 3.265) (JCR 2015: 85/213, 39.9%, Oncology)

Characterization of Genetic Aberrations in a Single Case of Metastatic Thymic Adenocarcinoma.

  • Author : Yeonghun Lee, Sehhoon Park, Se-Hoon Lee$, and Hyunju Lee$
  • Published Date : 2017
  • Category : Bioinformatics and Text Mining 
  • Place of publication : BMC Cancer

 

BACKGROUND:

Thymic adenocarcinoma is an extremely rare subtype of thymic epithelial tumors. Due to its rarity, there is currently no sequencing approach for thymic adenocarcinoma.

METHODS:

We performed whole exome and transcriptome sequencing on a case of thymic adenocarcinoma and performed subsequent validation using Sanger sequencing.

RESULTS:

The case of thymic adenocarcinoma showed aggressive behaviors with systemic bone metastases. We identified a high incidence of genetic aberrations, which included somatic mutations in RNASEL, PEG10, TNFSF15, TP53, TGFB2, and FAT1. Copy number analysis revealed a complex chromosomal rearrangement of chromosome 8, which resulted in gene fusion between MCM4 and SNTB1 and dramatic amplification of MYC and NDRG1. Focal deletion was detected at human leukocyte antigen (HLA) class II alleles, which was previously observed in thymic epithelial tumors. We further investigated fusion transcripts using RNA-seq data and found an intergenic splicing event between the CTBS and GNG5 transcript. Finally, enrichment analysis using all the variants represented the immune system dysfunction in thymic adenocarcinoma.

CONCLUSION:

Thymic adenocarcinoma shows highly malignant characteristics with alterations in several cancer-related genes.

Seungchul Lee#, Jingu Lee#, Sung Hoon Sim#, Yeonghun Lee, Kyung Chul Moon, Cheol Lee, Woong-Yang Park, Nayoung K. D. Kim, Se-Hoon Lee$, and Hyunju Lee$ (2017) Comprehensive somatic genome alterations of urachal carcinoma. Journal of Medical Genetics, 2017 August 01; 54(8):572-578 (IF: 5.650) (JCR 2016: 19/166, 11.145%, GENETICS & HEREDITY)

Comprehensive somatic genome alterations of urachal carcinoma.

  • Author : Seungchul Lee#, Jingu Lee#, Sung Hoon Sim#, Yeonghun Lee, Kyung Chul Moon, Cheol Lee, Woong-Yang Park, Nayoung K. D. Kim, Se-Hoon Lee$, and Hyunju Lee$
  • Published Date : 2017
  • Category : Bioinformatics and Text Mining 
  • Place of publication : Journal of Medical Genetics

 

Abstract

Background: Urachal cancer is a rare cancer that develops in the urachus. Because of its rarity, standard treatment therapies for urachal cancer are not established, and chemotherapeutic regimens for bladder cancer have been unsuccessful for patients with urachal cancer. Hence, we aim to understand a systematic molecular characterization of urachal cancer.

Methods: We identified somatic single nucleotide variations (SNVs)/indels and somatic copy number aberrations (SCNAs) in the 17 patients by using whole-exome sequencing (WES) and OncoScanTM platform (Affymetrix) as follows: tumour-normal paired sequencing (WES, n = 10), tumour-only sequencing (WES, n = 1; targeted deep sequencing, n = 16), and OncoScanTM (n = 17).

Results: Our analyses identified 27 genes with somatic SNVs and indels, as well as six genes (APC, COL5A1, KIF26B, LRP1B, SMAD4, and TP53) that were recurrent in at least two patients. By analysing the SCNAs, we found that the extent of chromosomal amplifica tion was highly associated with the patient’s cancer stage. Interestingly, 35% (6/17) of the patients had focal DNA amplifications in FGFR family genes. The integration of somatic SNVs, indels, and SCNAs revealed significant alterations in the MAPK signalling pathways.

Conclusions: Our genome wide analysis of urachal cancer suggests that molecular characteristics may be important for the treatment of urachal cancer.

Jaeyong Kang and Hyunju Lee* (2017) Modeling User Interest in Social Media using News Media and Wikipedia. Information Systems, 2017 April 01; 65:52-64 (IF: 1.832) (JCR 2016: 34/144, 23.61%, COMPUTER SCIENCE, INFORMATION SYSTEMS).

Modeling User Interest in Social Media using News Media and Wikipedia.

  • Author : Jaeyong Kang and Hyunju Lee
  • Published Date : 2017
  • Category : Social Media
  • Place of publication : Information Systems

 

Abstract

Social media has become an important source of information and a medium for following and spreading trends, news, and ideas all over the world. Although determining the subjects of individual posts is important to extract users’ interests from social media, this task is nontrivial because posts are highly contextualized and informal and have limited length. To address this problem, we propose a user modeling framework that maps the content of texts in social media to relevant categories in news media. In our framework, the semantic gaps between social media and news media are reduced by using Wikipedia as an external knowledge base. We map term-based features from a short text and a news category into Wikipedia-based features such as Wikipedia categories and article entities. A user’s microposts are thus represented in a rich feature space of words. Experimental results show that our proposed method using Wikipedia-based features outperforms other existing methods of identifying users’ interests from social media.

Jeongkyun Kim, Jung-jae Kim and Hyunju Lee* (2017) An analysis of disease-gene relationship from Medline abstracts by DigSee. Scientific Reports, 2017 January 05; 7:40154 (IF: 5.228) (JCR 2015: 7/63, 11.3%, MULTIDISCIPLINARY SCIENCES).

An analysis of disease-gene relationship from Medline abstracts by DigSee.

  • Author : Jeongkyun Kim, Jung-jae Kim and Hyunju Lee
  • Published Date : 2017
  • Category : Bioinformatics and Text Mining 
  • Place of publication : Scientific Reports

 

Abstract

Diseases are developed by abnormal behavior of genes in biological events such as gene regulation, mutation, phosphorylation, and epigenetics and post-translational modification. Many studies of text mining attempted to identify the relationship between gene and disease by mining the literature, but they did not consider the biological events in which genes show abnormal behaviour in response to diseases. In this study, we propose to identify disease-related genes that are involved in the development of disease through biological events from Medline abstracts. We identified associations between 13,054 genes and 4,494 disease types, which cover more disease-related genes than manually curated databases for all disease types (e.g., Online Mendelian Inheritance in Man) and also than those for specific diseases (e.g., Alzheimer’s disease and hypertension). We show that the text mining findings are reliable, as per the PubMed scale, in that the disease-disease relationships inferred from the literature-wide findings are similar to those inferred from manually curated databases in a well-known study. In addition, literature-wide distribution of biological events across disease types reveals different characteristics of disease types.

Jiyoun Seo, Daeyong Jin, Chan-Hun Choi and Hyunju Lee* (2017) Integration of MicroRNA, mRNA, and Protein Expression Data for the Identification of Cancer-Related MicroRNAs. PLoS One, 2017 January 5; 12(1):e0168412 (IF: 3.057) (JCR 2015: 11/63, 17.5%, MULTIDISCIPLINARY SCIENCES).

Integration of MicroRNA, mRNA, and Protein Expression Data for the Identification of Cancer-Related MicroRNAs.

  • Author : Jiyoun SeoDaeyong Jin,  Chan-Hun Choi, and Hyunju Lee
  • Published Date : 2017
  • Category : Bioinformatics and Text Mining 
  • Place of publication : PLoS One

 

Abstract

MicroRNAs (miRNAs) are responsible for the regulation of target genes involved in various biological processes, and may play oncogenic or tumor suppressive roles. Many studies have investigated the relationships between miRNAs and their target genes, using mRNA and miRNA expression data. However, mRNA expression levels do not necessarily represent the exact gene expression profiles, since protein translation may be regulated in several different ways. Despite this, large-scale protein expression data have been integrated rarely when predicting gene-miRNA relationships. This study explores two approaches for the investigation of gene-miRNA relationships by integrating mRNA expression and protein expression data. First, miRNAs were ranked according to their effects on cancer development. We calculated influence scores for each miRNA, based on the number of significant mRNA-miRNA and protein-miRNA correlations. Furthermore, we constructed modules containing mRNAs, proteins, and miRNAs, in which these three molecular types are highly correlated. The regulatory interactions between miRNA and genes in these modules have been validated based on the direct regulations, indirect regulations, and co-regulations through transcription factors. We applied our approaches to glioblastomas (GBMs), ranked miRNAs depending on their effects on GBM, and obtained 52 GBM-related modules. Compared with the miRNA rankings and modules constructed using only mRNA expression data, the rankings and modules constructed using mRNA and protein expression data were shown to have better performance. Additionally, we experimentally verified that miR-504, highly ranked and included in the identified modules, plays a suppressive role in GBM development. We demonstrated that the integration of both expression profiles allows a more precise analysis of gene-miRNA interactions and the identification of a higher number of cancer-related miRNAs and regulatory mechanisms.

 

Wonjun Choi, Baeksoo Kim, Hyejin Cho, Doheon Lee and Hyunju Lee* (2016) A corpus for plant-chemical relationships in the biomedical domain. BMC Bioinformatics, 2016 September 20; 17:386 (IF: 2.435) (JCR 2015: 10/56, 17.9%, MATHEMATICAL & COMPUTATIONAL BIOLOGY).

A corpus for plant-chemical relationships in the biomedical domain.

 

Abstract

Background: Plants are natural products that humans consume in various ways including food and medicine. They have a long empirical history of treating diseases with relatively few side effects. Based on these strengths, many studies have been performed to verify the effectiveness of plants in treating diseases. It is crucial to understand the chemicals contained in plants because these chemicals can regulate activities of proteins that are key factors in causing diseases. With the accumulation of a large volume of biomedical literature in various databases such as PubMed, it is possible to automatically extract relationships between plants and chemicals in a large-scale way if we apply a text mining approach. A cornerstone of achieving this task is a corpus of relationships between plants and chemicals.

Results: In this study, we first constructed a corpus for plant and chemical entities and for the relationships between them. The corpus contains 267 plant entities, 475 chemical entities, and 1,007 plant–chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Inter-annotator agreement scores for the corpus among three annotators were measured. The simple percent agreement scores for entities and trigger words for the relationships were 99.6 and 94.8 %, respectively, and the overall kappa score for the classification of positive and negative relationships was 79.8 %. We also developed a rule-based model to automatically extract such plant–chemical relationships. When we evaluated the rule-based model using the corpus and randomly selected biomedical articles, overall F-scores of 68.0 and 61.8 % were achieved, respectively.

Conclusion: We expect that the corpus for plant–chemical relationships will be a useful resource for enhancing plant research. The corpus is available at http://combio.gist.ac.kr/plantchemicalcorpus.

Corpus URL: http://combio.gist.ac.kr/herding

Daeyong Jin and Hyunju Lee* (2016) Prioritizing cancer-related microRNAs by integrating microRNA and mRNA datasets. Scientific Reports, 2016 October 13; 6:35350 (IF: 5.228) (JCR 2015: 7/62, 11.3%, MULTIDISCIPLINARY SCIENCES).

Prioritizing cancer-related microRNAs by integrating microRNA and mRNA datasets.

  • Author : Daeyong Jin and Hyunju Lee
  • Published Date : 2016
  • Category : Bioinformatics and Text Mining 
  • Place of publication : Scientific Reports

 

Abstract

MicroRNAs (miRNAs) are small non-coding RNAs regulating the expression of target genes, and they are involved in cancer initiation and progression. Even though many cancer-related miRNAs were identified, their functional impact may vary, depending on their effects on the regulation of other miRNAs and genes. In this study, we propose a novel method for the prioritization of candidate cancer-related miRNAs that may affect the expression of other miRNAs and genes across the entire biological network. For this, we propose three important features: the average expression of a miRNA in multiple cancer samples, the average of the absolute correlation values between the expression of a miRNA and expression of all genes, and the number of predicted miRNA target genes. These three features were integrated using order statistics. By applying the proposed approach to four cancer types, glioblastoma, ovarian cancer, prostate cancer, and breast cancer, we prioritized candidate cancer-related miRNAs and determined their functional roles in cancer-related pathways. The proposed approach can be used to identify miRNAs that play crucial roles in driving cancer development, and the elucidation of novel potential therapeutic targets for cancer treatment.