Menu Close

A Natural Language Processing Approach to Extract Detailed Cannabis Use Patterns from Unstructured EHRs

Kimia Zandbiglari
University of Florida

Co-Authors: Shobhan Kumar, Sebastian Jugl, Masoud Rouhizadeh
University of Florida

Background: Cannabis use for medical and recreational purposes has significantly increased recently. Yet, structured electronic health records (EHR) often lack detailed information on usage patterns, including mode, frequency, and co-use with other substances, necessitating analysis of unstructured clinical notes. Understanding these use details can improve healthcare delivery and public policy amid growing cannabis access and acceptance.

Objective: This study seeks to uncover detailed cannabis use patterns through natural language processing (NLP) analysis of free-text EHR notes, focusing on mode of administration, usage frequency, and co-use with other substances.

Methods: We analyzed 1162 sentences from the clinical notes of 500 unique patients, with a positive cannabis use status, seen at the University of Florida Health system within a 60-day timeframe per patient from their first visit in the health system. The notes were searched using the Cannabis Use Lexicon (CULx) that our group previously developed to identify textual mentions of cannabis use. We manually reviewed the extracted sentences to categorize cannabis use mentions across several axes, including: mode of cannabis administration (‘By Mouth’, ‘Smoking/Vaping’, ‘Topical’, ‘Unknown’); frequency of use (‘Unknown’, ‘Daily’, ‘Weekly’, ‘Occasionally’); and co-use with other substances (‘Alcohol’, ‘Illicit Drugs’, ‘Opioids/Opiates’, ‘Other Medications’ including selected other prescribed or OTC medications, ‘Tobacco/Nicotine’, ‘Unknown’). To accurately classify these categories from the sentences, we trained and applied several NLP models, including BERT, RoBERTa, Clinical BERT, Decision Trees, and Logistic Regression. Performance was assessed via accuracy, sensitivity, and positive predictive value (PPV).

Results: Of the 500 patients, 39.4% had documented cannabis use modes, 44.2% frequency, and 34% co-use with substances. Smoking/vaping and daily use were most common, with notable tobacco/nicotine and alcohol co-use. The NLP classifiers, achieving up to 93% accuracy, 94% PPV, and 93% sensitivity, closely matched human annotation in detecting cannabis use patterns. RoBERTa excelled in use methods and co-use identification, with over 93% PPV and sensitivity, especially for co-use (95% PPV, 94% sensitivity). ClinicalBERT was most accurate in frequency identification, with 89% PPV and 90% sensitivity.

Conclusions: Our manual annotation revealed frequent mentions of detailed cannabis use patterns reported in unstructured EHR notes. We successfully developed a preliminary NLP system to automatically replicate expert manual categorization of these details. Our work provides a strong foundation for recognizing and categorizing these patterns within EHRs to ultimately facilitate research into cannabis use trends within unstructured EHR notes.