A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework

Zaineb Chelly Dagdia, Christine Zarges, Gaël Beck, Mustapha Lebbah

Research output: Chapter in Book/Report/Conference proceedingConference Proceeding (Non-Journal item)

15 Citations (SciVal)
258 Downloads (Pure)


Big Data reduction is a main point of interest across a wide variety of fields. This domain was further investigated when the difficulty in quickly acquiring the most useful information from the huge amount of data at hand was encountered. To achieve the task of data reduction, specifically feature selection, several state-of-the-art methods were proposed. However, most of them require additional information about the given data for thresholding, noise levels to be specified or they even need a feature ranking procedure. Thus, it seems necessary to think about a more adequate feature selection technique which can extract features using information contained within the dataset alone. Rough Set Theory (RST) can be used as such a technique to discover data dependencies and to reduce the number of features contained in a dataset using the data alone, requiring no additional information. However, despite being a powerful feature selection technique, RST is computationally expensive and only practical for small datasets. Therefore, in this paper, we present a novel efficient distributed Rough Set Theory based algorithm for large-scale data pre-processing under the Spark framework. Our experimental results show the efficient applicability of our RST solution to Big Data without any significant information loss.
Original languageEnglish
Title of host publication2017 IEEE International Conference on Big Data (Big Data)
EditorsJian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghumath Nambiar, Chonggang Wang, Hui Zang, Ricardo Baeza-Yates, Xiaohua Hu, Jeremy Kepner, Alfredo Cuzzocrea, Jian Tang, Masashi Toyoda
PublisherIEEE Press
ISBN (Electronic)978-1-5386-2715-0
Publication statusPublished - 15 Jan 2018
Event2017 IEEE International Conference on Big Data(BigData 2017) - Boston, United States of America
Duration: 11 Dec 201714 Dec 2017


Conference2017 IEEE International Conference on Big Data(BigData 2017)
Country/TerritoryUnited States of America
Period11 Dec 201714 Dec 2017


  • Big data pre-processing
  • Feature selection
  • Rough set theory
  • Distributed processing
  • Scalability


Dive into the research topics of 'A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework'. Together they form a unique fingerprint.

Cite this