TY - JOUR
T1 - Examining the Gateway Hypothesis and Mapping Substance Use Pathways on Social Media : A Machine Learning Approach
AU - Yuan, Yunhao
AU - Kasson, Erin
AU - Taylor, Jordan
AU - Cavazos-Rehg, Patricia
AU - Choudhury, Munmun De
AU - Aledavood, Talayeh
N1 - Publisher Copyright: © 2023 JMIR Publications Inc.. All rights reserved.
PY - 2023/4/6
Y1 - 2023/4/6
N2 - Background: Substance misuse presents significant global public health challenges. Understanding transitions between substance types and the timing of shifts to polysubstance use is vital for targeted prevention, harm reduction, and recovery strategies. The longstanding gateway hypothesis suggests high-risk substance use is preceded by lower-risk substance use. However, the source of this correlation is hotly contested. While some claim that low-risk substance use causes subsequent, riskier substance use, most users of low-risk substances also do not escalate to higher-risk substances. Social media data holds the potential to shed light on the factors contributing to substance use transitions. Objective: By leveraging social media data, our study aims to gain a better understanding of substance use pathways. By identifying and analyzing the transitions of individuals between different risk levels of substance use, our goal is to find specific linguistic cues in individuals' social media posts that could be indicative of escalating or de-escalating patterns in substance use. Methods: We conducted a large-scale analysis using data from Reddit, collected between 2015 and 2019, consisting of over 2.29 million posts and approximately 29.37 million comments by around 1.4 million users from subreddits. This data, derived from substance use subreddits, facilitated the creation of a risk transition dataset reflecting the substance use behaviors of over 1.4 million users. We deployed deep learning and machine learning techniques, including fine-tuned BERT and RoBERTa models, to predict the escalation or de-escalation in risk levels based on initial transition phases documented in posts and comments. Additionally, we conducted an extensive linguistic analysis to analyze the language patterns associated with transitions in substance use, emphasizing the role of n-gram features in predicting future risk trajectories. Results: Our results showed promise in predicting the escalation or de-escalation in risk levels based on the historical data of Reddit users created on initial transition phases among drug-related subreddits with an accuracy of 78.48% and an F1-score of 79.20%. We highlighted the vital predictive features, such as specific substance names and tools indicative of future risk escalations. Our linguistic analysis showed terms linked with harm reduction strategies were instrumental in signaling deescalation, whereas descriptors of frequent substance use were characteristic of escalating transitions. Conclusions: This study sheds light on the complexities surrounding the gateway hypothesis of substance use through an examination of online behavior on Reddit. While certain findings validate the hypothesis, indicating a progression from lower-risk substances like marijuana to higher-risk ones, a significant number of individuals did not showcase this transition. The research underscores the potential of using machine learning in conjunction with social media analysis for predicting substance use transitions. Our results emphasize the role of linguistic features as predictors and the importance of timely interventions.
AB - Background: Substance misuse presents significant global public health challenges. Understanding transitions between substance types and the timing of shifts to polysubstance use is vital for targeted prevention, harm reduction, and recovery strategies. The longstanding gateway hypothesis suggests high-risk substance use is preceded by lower-risk substance use. However, the source of this correlation is hotly contested. While some claim that low-risk substance use causes subsequent, riskier substance use, most users of low-risk substances also do not escalate to higher-risk substances. Social media data holds the potential to shed light on the factors contributing to substance use transitions. Objective: By leveraging social media data, our study aims to gain a better understanding of substance use pathways. By identifying and analyzing the transitions of individuals between different risk levels of substance use, our goal is to find specific linguistic cues in individuals' social media posts that could be indicative of escalating or de-escalating patterns in substance use. Methods: We conducted a large-scale analysis using data from Reddit, collected between 2015 and 2019, consisting of over 2.29 million posts and approximately 29.37 million comments by around 1.4 million users from subreddits. This data, derived from substance use subreddits, facilitated the creation of a risk transition dataset reflecting the substance use behaviors of over 1.4 million users. We deployed deep learning and machine learning techniques, including fine-tuned BERT and RoBERTa models, to predict the escalation or de-escalation in risk levels based on initial transition phases documented in posts and comments. Additionally, we conducted an extensive linguistic analysis to analyze the language patterns associated with transitions in substance use, emphasizing the role of n-gram features in predicting future risk trajectories. Results: Our results showed promise in predicting the escalation or de-escalation in risk levels based on the historical data of Reddit users created on initial transition phases among drug-related subreddits with an accuracy of 78.48% and an F1-score of 79.20%. We highlighted the vital predictive features, such as specific substance names and tools indicative of future risk escalations. Our linguistic analysis showed terms linked with harm reduction strategies were instrumental in signaling deescalation, whereas descriptors of frequent substance use were characteristic of escalating transitions. Conclusions: This study sheds light on the complexities surrounding the gateway hypothesis of substance use through an examination of online behavior on Reddit. While certain findings validate the hypothesis, indicating a progression from lower-risk substances like marijuana to higher-risk ones, a significant number of individuals did not showcase this transition. The research underscores the potential of using machine learning in conjunction with social media analysis for predicting substance use transitions. Our results emphasize the role of linguistic features as predictors and the importance of timely interventions.
UR - http://www.scopus.com/inward/record.url?scp=85190825442&partnerID=8YFLogxK
U2 - 10.2196/54433
DO - 10.2196/54433
M3 - Review Article
AN - SCOPUS:85190825442
SN - 2561-326X
VL - 8
SP - 1
EP - 19
JO - JMIR Formative Research
JF - JMIR Formative Research
M1 - e54433
ER -