TY - GEN
T1 - Clustering data with the presence of missing values by ensemble approach
AU - Pattanodom, Mullika
AU - Iam-On, Natthakan
AU - Boongoen, Tossapon
N1 - Funding Information:
This research study has been sponsored by Mae Fah Luang University, Chiang Rai, Thailand.
Publisher Copyright:
© 2016 IEEE.
PY - 2016/3/21
Y1 - 2016/3/21
N2 - The problem of missing values arise as one of the major difficulties in data mining and the downstreaming applications. In fact, most of the analytical techniques established in this field have been developed to handle a complete data set. Imputing or filling in missing values is generally regarded as a data preprocessing task, for which several methods has been introduced. These include a collection of statistical alternatives such as average and zero imputes, as well as learning-led models like nearest neighbors and regression. As for cluster analysis, various clustering algorithms, even k-means the most well-known, are hardly design to handle such a problem. This is also the case with cluster ensembles, where an improved decision is generated upon multiple results of clustering complete data. The paper presents a new framework that allows clustering incomplete data without the usual preprocessing step. Intuitively, different versions of the original data can be created by filling in those unknown values with arbitrary ones. This random selection is simple and efficient, while promotes the diversity within an ensemble, hence its quality. In particular, Binary cluster-association matrix (BA) has been adopted to summarize ensemble information, from which k-means is exploited to derive the final clustering. The proposed model is evaluated against a number of benchmark imputation methods, over different datasets obtained from UCI repository. Based on the evaluation metric of cluster accuracy (CA), the findings suggest more accurate outcome is usually observed with the new framework. This motivates an application of the proposed approach to problems specific to Thai armed forces, such as identification of attacks that is presently in the spotlight for cyber security.
AB - The problem of missing values arise as one of the major difficulties in data mining and the downstreaming applications. In fact, most of the analytical techniques established in this field have been developed to handle a complete data set. Imputing or filling in missing values is generally regarded as a data preprocessing task, for which several methods has been introduced. These include a collection of statistical alternatives such as average and zero imputes, as well as learning-led models like nearest neighbors and regression. As for cluster analysis, various clustering algorithms, even k-means the most well-known, are hardly design to handle such a problem. This is also the case with cluster ensembles, where an improved decision is generated upon multiple results of clustering complete data. The paper presents a new framework that allows clustering incomplete data without the usual preprocessing step. Intuitively, different versions of the original data can be created by filling in those unknown values with arbitrary ones. This random selection is simple and efficient, while promotes the diversity within an ensemble, hence its quality. In particular, Binary cluster-association matrix (BA) has been adopted to summarize ensemble information, from which k-means is exploited to derive the final clustering. The proposed model is evaluated against a number of benchmark imputation methods, over different datasets obtained from UCI repository. Based on the evaluation metric of cluster accuracy (CA), the findings suggest more accurate outcome is usually observed with the new framework. This motivates an application of the proposed approach to problems specific to Thai armed forces, such as identification of attacks that is presently in the spotlight for cyber security.
KW - Cluster ensemble
KW - Data clustering
KW - Missing value
KW - Random imputation
UR - http://www.scopus.com/inward/record.url?scp=84966601490&partnerID=8YFLogxK
U2 - 10.1109/ACDT.2016.7437660
DO - 10.1109/ACDT.2016.7437660
M3 - Conference Proceeding (Non-Journal item)
AN - SCOPUS:84966601490
T3 - 2016 2nd Asian Conference on Defence Technology, ACDT 2016
SP - 151
EP - 156
BT - 2016 2nd Asian Conference on Defence Technology, ACDT 2016
PB - IEEE Press
T2 - 2nd Asian Conference on Defence Technology, ACDT 2016
Y2 - 21 January 2016 through 26 January 2016
ER -