TY - JOUR
T1 - Clustering data with the presence of attribute noise
T2 - A study of noise completely at random and ensemble of multiple k-means clusterings
AU - Iam-On, Natthakan
N1 - Funding Information:
This work is funded by IAPP1-100077 (Newton RAE-TRF): Anomaly Traffic Identification through Artificial Intelligence, Cyber Security and Big Data Analytics Technologies. It is also partly supported by Mae Fah Luang University.
Publisher Copyright:
© 2019, Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2020/3/1
Y1 - 2020/3/1
N2 - In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.
AB - In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.
KW - Attribute noise
KW - Cluster ensemble
KW - Data clustering
KW - NCAR
KW - Robustness
UR - http://www.scopus.com/inward/record.url?scp=85079510618&partnerID=8YFLogxK
U2 - 10.1007/s13042-019-00989-4
DO - 10.1007/s13042-019-00989-4
M3 - Article
AN - SCOPUS:85079510618
SN - 1868-8071
VL - 11
SP - 491
EP - 509
JO - International Journal of Machine Learning and Cybernetics
JF - International Journal of Machine Learning and Cybernetics
IS - 3
ER -