Ruba Sajdeya
University of Florida
Co-Authors: Chen Bai1, Sebastian Jugl1, Ronald L. Ison1, Mamoun T. Mardini1, Patrick J. Tighe1, Kimia Zandbiglari1, Hanzhi Gao1, Almut G. Winterstein1, Thomas A. Pearson1, Robert L. Cook1, Masoud Rouhizadeh1
1University of Florida
Background
Utilizing electronic health record (EHR) data in clinical cannabis research is limited and prone to significant bias when relying on structured data types. This is due to most cannabis-related information being documented in unstructured narrative clinical notes. We aimed to develop a natural language processing algorithm (NLP) using machine learning (ML) techniques to extract preoperative cannabis use status documentation from unstructured narrative clinical notes.
Methods
We created a comprehensive lexicon with 3,630 concepts corresponding to cannabis use that may appear in clinical notes. We applied a keyword search strategy to identify note snippets with matching cannabis use status concepts within 60 days of surgery among 1,500 random surgery patients aged ≥ 65 years old at Shands Hospital in 2018-2020. We manually annotated and classified 1,928 matching snippets from 463 unique notes of 173 patients into eight different categories based on context, time, and certainty of cannabis use documentation. Using the labeled snippets, we trained two conventional ML and three deep learning classifiers to replicate human annotation.
Results
Of 1,928 annotated snippets, 56.28% were classified as \”\”Positive current use,\”\” 22.93% as \”\”Not a true cannabis mention,\”\” 10.79% as \”\”Positive past use,\”\” 6.38% as \”\” True mention not reporting use status,\”\” and 3.63% as \”\”Negative current use.\”\” There was no documentation representing \”\”Uncertain past use,\”\” \”\”Uncertain current use,\”\” or \”\”Negative past use.\”\” The top two matching keywords were “marijuana” 1,086 (56.33%) and “CBD” 538 (27.90%). The tested classifiers achieved classification results close to human performance with up to 93% precision and 95% recall of preoperative cannabis use status documentation.
Conclusion
Our NLP model successfully replicated human annotation of preoperative cannabis use documentation, providing a baseline framework for identifying and classifying documentation of cannabis use. Our model can be employed to identify clinical cohorts with reliable cannabis exposure data and appropriate control groups from EHR data. This will support high-quality cannabis-related clinical outcomes research aiming to address the existing knowledge gaps and guide clinical practices and policies. Moreover, our systematically developed lexicon provides a comprehensive knowledge-based resource covering a wide range of cannabis-related concepts for future NLP applications.