Stability Selection using a Genetic Algorithm and Logistic Linear Regression on Healthcare Records

Aleš Zamuda, Christine Zarges, Gregor Stiglic, Goran Hrovat

Research output: Chapter in Book/Report/Conference proceedingConference Proceeding (Non-Journal item)

263 Downloads (Pure)


This paper presents a Genetic Algorithm (GA) application to measuring feature importance in machine learning (ML) from a large-scale database. Too many input features may cause over-fitting, therefore a feature selection is desirable. Some ML algorithms have feature selection embedded, e.g., lasso penalized linear regression or random forests. Others do not include such functionality and are sensitive to over-fitting, e.g., unregularized linear regression. The latter algorithms require that proper features are chosen before learning.

Therefore, we propose a novel stability selection (SS) approach using GA-based feature selection. The proposed SS approach iteratively applies GA on a subsample of records and features. Each GA individual represents a binary vector of selected features in the subsample. An unregularized logistic linear regression model is then trained and tested using GA-selected features through cross-validation of the subsamples. GA fitness is evaluated by area under the curve (AUC) and optimized during a GA run.

AUC is assessed with an unregularized logistic regression model on multiple-subsampled healthcare records, collected under the Healthcare Cost, and Utilization Project (HCUP), utilizing the National (Nationwide) Inpatient Sample (NIS) database.

Reported results show that averaging feature importance from top-4 SS and the SS using GA (GASS), improves these AUC results.
Original languageEnglish
Title of host publicationGECCO '17
Subtitle of host publicationProceedings of the Genetic and Evolutionary Computation Conference Companion
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Number of pages2
ISBN (Print)978-1-4503-4939-0
Publication statusPublished - 15 Jul 2017
EventGECCO 2017: The Genetic and Evolutionary Computation Conference -
Duration: 15 Jul 201719 Jul 2017


ConferenceGECCO 2017
Period15 Jul 201719 Jul 2017


Dive into the research topics of 'Stability Selection using a Genetic Algorithm and Logistic Linear Regression on Healthcare Records'. Together they form a unique fingerprint.

Cite this