TY - JOUR
T1 - Graph clustering-based discretization approach to microarray data
AU - Sriwanna, Kittakorn
AU - Boongoen, Tossapon
AU - Iam-On, Natthakan
N1 - Publisher Copyright:
© 2018, Springer-Verlag London Ltd., part of Springer Nature.
PY - 2019/8/1
Y1 - 2019/8/1
N2 - Several techniques in data mining require discrete data. In fact, learning with discrete domains often performs better than the case of continuous data. Multivariate discretization is the algorithm that transforms continuous data to discrete one by considering correlations among attributes. Given the benefit of this idea, many multivariate discretization algorithms have been proposed. However, there are a few discretization algorithms that directly apply to microarray or gene expression data, which is high-dimensional and unbalance data. Even so interesting, no multivariate method has been put forward for microarray data analysis. According to the recent published research, graph clustering-based discretization of splitting and merging methods (GraphS and GraphM) usually achieves superior results compared to many well-known discretization algorithms. In this paper, GraphS and GraphM are extended by adding the alpha parameter that is the ratio between the similarity of gene expressions (distance) and the similarity of the class label. Moreover, the extensions consider 3 similarity measures of cosine similarity, Euclidean distance, and Pearson correlation in order to determine the proper pairwise similarity measure. The evaluation against 20 real microarray datasets and 4 classifiers suggests that the results of three classification performances (ACC, AUC, Kappa) and running time of two proposed methods based on cosine similarity, GraphM(C) and GraphS(C) are better than 9 state-of-the-art discretization algorithms.
AB - Several techniques in data mining require discrete data. In fact, learning with discrete domains often performs better than the case of continuous data. Multivariate discretization is the algorithm that transforms continuous data to discrete one by considering correlations among attributes. Given the benefit of this idea, many multivariate discretization algorithms have been proposed. However, there are a few discretization algorithms that directly apply to microarray or gene expression data, which is high-dimensional and unbalance data. Even so interesting, no multivariate method has been put forward for microarray data analysis. According to the recent published research, graph clustering-based discretization of splitting and merging methods (GraphS and GraphM) usually achieves superior results compared to many well-known discretization algorithms. In this paper, GraphS and GraphM are extended by adding the alpha parameter that is the ratio between the similarity of gene expressions (distance) and the similarity of the class label. Moreover, the extensions consider 3 similarity measures of cosine similarity, Euclidean distance, and Pearson correlation in order to determine the proper pairwise similarity measure. The evaluation against 20 real microarray datasets and 4 classifiers suggests that the results of three classification performances (ACC, AUC, Kappa) and running time of two proposed methods based on cosine similarity, GraphM(C) and GraphS(C) are better than 9 state-of-the-art discretization algorithms.
KW - Data mining
KW - Graph clustering
KW - High-dimensional data
KW - Microarray data
KW - Multivariate discretization
UR - http://www.scopus.com/inward/record.url?scp=85053438526&partnerID=8YFLogxK
U2 - 10.1007/s10115-018-1249-z
DO - 10.1007/s10115-018-1249-z
M3 - Article
AN - SCOPUS:85053438526
SN - 0219-1377
VL - 60
SP - 879
EP - 906
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 2
ER -