Projects per year
Abstract
This paper presents a Genetic Algorithm (GA) application to measuring feature importance in machine learning (ML) from a large-scale database. Too many input features may cause over-fitting, therefore a feature selection is desirable. Some ML algorithms have feature selection embedded, e.g., lasso penalized linear regression or random forests. Others do not include such functionality and are sensitive to over-fitting, e.g., unregularized linear regression. The latter algorithms require that proper features are chosen before learning.
Therefore, we propose a novel stability selection (SS) approach using GA-based feature selection. The proposed SS approach iteratively applies GA on a subsample of records and features. Each GA individual represents a binary vector of selected features in the subsample. An unregularized logistic linear regression model is then trained and tested using GA-selected features through cross-validation of the subsamples. GA fitness is evaluated by area under the curve (AUC) and optimized during a GA run.
AUC is assessed with an unregularized logistic regression model on multiple-subsampled healthcare records, collected under the Healthcare Cost, and Utilization Project (HCUP), utilizing the National (Nationwide) Inpatient Sample (NIS) database.
Reported results show that averaging feature importance from top-4 SS and the SS using GA (GASS), improves these AUC results.
Therefore, we propose a novel stability selection (SS) approach using GA-based feature selection. The proposed SS approach iteratively applies GA on a subsample of records and features. Each GA individual represents a binary vector of selected features in the subsample. An unregularized logistic linear regression model is then trained and tested using GA-selected features through cross-validation of the subsamples. GA fitness is evaluated by area under the curve (AUC) and optimized during a GA run.
AUC is assessed with an unregularized logistic regression model on multiple-subsampled healthcare records, collected under the Healthcare Cost, and Utilization Project (HCUP), utilizing the National (Nationwide) Inpatient Sample (NIS) database.
Reported results show that averaging feature importance from top-4 SS and the SS using GA (GASS), improves these AUC results.
Original language | English |
---|---|
Title of host publication | GECCO '17 |
Subtitle of host publication | Proceedings of the Genetic and Evolutionary Computation Conference Companion |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 143-144 |
Number of pages | 2 |
ISBN (Print) | 978-1-4503-4939-0 |
DOIs | |
Publication status | Published - 15 Jul 2017 |
Event | GECCO 2017: The Genetic and Evolutionary Computation Conference - Duration: 15 Jul 2017 → 19 Jul 2017 |
Conference
Conference | GECCO 2017 |
---|---|
Period | 15 Jul 2017 → 19 Jul 2017 |
Fingerprint
Dive into the research topics of 'Stability Selection using a Genetic Algorithm and Logistic Linear Regression on Healthcare Records'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Improving Applicability of Nature-Inspired Optimisation by Joining Theory and Practice - ImAppNIO
Jansen, T. (PI)
09 Mar 2016 → 08 Mar 2020
Project: Externally funded research