TY - JOUR
T1 - Unsupervised fuzzy-rough set-based dimensionality reduction
AU - MacParthaláin, Neil Seosamh
AU - Jensen, Richard
N1 - MacParthaláin, N. S., Jensen, R. (2013). Unsupervised fuzzy-rough set-based dimensionality reduction. Information Sciences, 229, 106-121.
PY - 2013/4/20
Y1 - 2013/4/20
N2 - Each year worldwide, more and more data is collected. In fact, it is estimated that the amount of data collected and stored at least doubles every 2 years. Of this data, a large percentage is unlabelled or has labels which are incomplete or missing. It is because this data is so large that it becomes very difficult for humans to manually assign labels to data objects. Additionally, many real-world application datasets such as those in gene expression analysis, and text classification are also of large dimensionality. This further frustrates the process of label assignment for domain experts as not all of the features are relevant or necessary in order to assign a given label. Hence unsupervised feature selection is required. For supervised learning, feature selection algorithms attempt to maximise a given function of predictive accuracy. This function typically considers the ability of feature vectors to reflect decision class labels. However, for the unsupervised learning task, decision class labels are not provided, which poses questions such as: which features should be retained? In fact, not all features are important and some are irrelevant, redundant or noisy. In this paper, several unsupervised FS approaches are presented which are based on fuzzy-rough sets. These approaches require no thresholding information, are domain-independent, and can operate on real-valued data without the need for discretisation. They offer a significant reduction in dimensionality whilst retaining the semantics of the data, and can even result in supersets of the supervised fuzzy-rough approaches. The approaches are compared with some supervised techniques and are shown to retain useful features.
AB - Each year worldwide, more and more data is collected. In fact, it is estimated that the amount of data collected and stored at least doubles every 2 years. Of this data, a large percentage is unlabelled or has labels which are incomplete or missing. It is because this data is so large that it becomes very difficult for humans to manually assign labels to data objects. Additionally, many real-world application datasets such as those in gene expression analysis, and text classification are also of large dimensionality. This further frustrates the process of label assignment for domain experts as not all of the features are relevant or necessary in order to assign a given label. Hence unsupervised feature selection is required. For supervised learning, feature selection algorithms attempt to maximise a given function of predictive accuracy. This function typically considers the ability of feature vectors to reflect decision class labels. However, for the unsupervised learning task, decision class labels are not provided, which poses questions such as: which features should be retained? In fact, not all features are important and some are irrelevant, redundant or noisy. In this paper, several unsupervised FS approaches are presented which are based on fuzzy-rough sets. These approaches require no thresholding information, are domain-independent, and can operate on real-valued data without the need for discretisation. They offer a significant reduction in dimensionality whilst retaining the semantics of the data, and can even result in supersets of the supervised fuzzy-rough approaches. The approaches are compared with some supervised techniques and are shown to retain useful features.
KW - Unsupervised learning
KW - Unsupervised feature selection
KW - Feature selection
KW - Attribute reduction
KW - Fuzzy set
KW - Rough set
UR - http://hdl.handle.net/2160/12640
U2 - 10.1016/j.ins.2012.12.001
DO - 10.1016/j.ins.2012.12.001
M3 - Article
SN - 0020-0255
VL - 229
SP - 106
EP - 121
JO - Information Sciences
JF - Information Sciences
ER -