Automated Feature Selection of Predictors in Electronic Medical Records Data.

Jessica Gronsbell, Jessica Minnier, Sheng Yu, Katherine Liao, Tianxi Cai

Biometrics 2018 October 25

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency. This article is protected by copyright. All rights reserved.

Full text links

We have located links that may give you full text access.

Show additional links to paperHide additional links to paper

PubMed

Add to Saved Papers

Get 1-tap access

Related Resources

Challenges in Septic Shock: From New Hemodynamics to Blood Purification Therapies.Fernando Ramasco et al.Journal of Personalized Medicine 2024 Februrary 4

Molecular Targets of Novel Therapeutics for Diabetic Kidney Disease: A New Era of Nephroprotection.Alessio Mazzieri et al.International Journal of Molecular Sciences 2024 April 4

The 'Ten Commandments' for the 2023 European Society of Cardiology guidelines for the management of endocarditis.Michael A Borger, Victoria DelgadoEuropean Heart Journal 2024 April 18

A Guide to the Use of Vasopressors and Inotropes for Patients in Shock.Anaas Moncef Mergoum et al.Journal of Intensive Care Medicine 2024 April 14

Pain during Cesarean Delivery: We Can and Must Do Better.Mark I Zakowski et al.Anesthesiology 2024 April 11

Diagnosis and Management of Cardiac Sarcoidosis: A Scientific Statement From the American Heart Association.Richard K Cheng et al.Circulation 2024 April 19

Essential thrombocythaemia: A contemporary approach with new drugs on the horizon.Francisca Ferrer-Marín et al.British Journal of Haematology 2024 April 9

Eosinophilic Esophagitis: Clinical Pearls for Primary Care Providers and Gastroenterologists.Rohit Goyal, Amrit K Kamboj, Diana L SnyderMayo Clinic Proceedings 2024 April

Executive Summary: State-of-the-Art Review: Unintended Consequences: Risk of Opportunistic Infections Associated with Long-term Glucocorticoid Therapies in Adults.Daniel B Chastain et al.Clinical Infectious Diseases 2024 April 11

For the best experience, use the Read mobile app

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

Automated Feature Selection of Predictors in Electronic Medical Records Data.

Full text links

Related Resources

Trending Papers

For the best experience, use the Read mobile app