Abstract
—A big challenge in the knowledge discovery process is to perform big data pre-processing; specifically feature selection. To handle this challenge, Rough Set Theory (RST) has been considered as one of the most powerful techniques
as it has much to offer for feature selection. To extend its applicability to big data, a distributed version of RST was developed. However, one of its key challenges is the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. In this paper, we propose a new distributed version of RST based on Locality Sensitive Hashing (LSH), named LSH-dRST, for big data pre-processing. LSHdRST uses LSH to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more appropriate way. We compare
LSH-dRST to the standard distributed RST technique which is based on a random partitioning of the universe and demonstrate that our LSH-dRST is not only scalable but also more reliable for feature selection; making it more relevant to big data preprocessing. We also demonstrate that our LSH-dRST ensures
the partitioning of the high dimensional feature search space in a more reliable way. Hence, guarantees data dependency in the distributed environment, and ensures a lower computational cost
as it has much to offer for feature selection. To extend its applicability to big data, a distributed version of RST was developed. However, one of its key challenges is the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. In this paper, we propose a new distributed version of RST based on Locality Sensitive Hashing (LSH), named LSH-dRST, for big data pre-processing. LSHdRST uses LSH to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more appropriate way. We compare
LSH-dRST to the standard distributed RST technique which is based on a random partitioning of the universe and demonstrate that our LSH-dRST is not only scalable but also more reliable for feature selection; making it more relevant to big data preprocessing. We also demonstrate that our LSH-dRST ensures
the partitioning of the high dimensional feature search space in a more reliable way. Hence, guarantees data dependency in the distributed environment, and ensures a lower computational cost
Original language | English |
---|---|
Title of host publication | 2018 IEEE International Conference on BIG DATA |
Publisher | IEEE Press |
Publication status | Published - 2018 |
Event | 2018 IEEE International Conference on BIG DATA - The Westin Seattle, Seattle, United States of America Duration: 10 Dec 2018 → 13 Dec 2018 |
Conference
Conference | 2018 IEEE International Conference on BIG DATA |
---|---|
Country/Territory | United States of America |
City | Seattle |
Period | 10 Dec 2018 → 13 Dec 2018 |
Keywords
- big data pre-processing
- feature selection
- rough set theory
- locality sensitive hashing
- distributed processing