Abstract
Astronomical data analytics has rapidly expanded given the advancement of data handling techniques and computing system. The race to discover new events is subject to acquiring and digesting the high volume of data from sky surveys efficiently, yet accurately. The assumption is valid for many modern astronomy projects, with the issue of big data storage on the one hand, and effective data analysis on the other. This research deals with the latter by focusing on the classification of potential transient events initially detected in time-domain astronomical surveys. Most of these candidate transients represent false positives that are the results of fault in hardware, errors in data collection and/or data pre-processing. Hence, the ability to filter these out is much needed to avoid a laborious manual assessment down the line. The problem investigated here is that training data can be highly imbalanced. For the first attempt, the coupling of oversampling methods and several classifiers provides an improvement, but generally leads to overfitting. As a solution, this paper presents a novel application of consensus clustering to undersample majority-class instances instead. It not only helps to overcome the aforementioned drawback but also strengthen the recent approach that exploits a single clustering to guide the selection of representative samples.
| Original language | English |
|---|---|
| Article number | 37382 |
| Number of pages | 18 |
| Journal | Scientific Reports |
| Volume | 15 |
| DOIs | |
| Publication status | Published - 27 Oct 2025 |