Overview
The RSDD (Reddit Self-reported Depression Diagnosis) dataset consists of Reddit posts for approximately 9,000 users who have claimed to have been diagnosed with depression ("diagnosed users") and approximately 107,000 matched control users. All posts made to mental health-related subreddits or containing keywords related to depression were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.
Further dataset construction details are available below and in Section 3.1 of the EMNLP 2017 paper Depression and Self-Harm Risk Assessment in Online Forums.
User Selection
The group of diagnosed users is made of users who (1) have a post containing a high-precision diagnosis pattern (e.g., "I was diagnosed with") and a mention of depression, and (2) do not match any exclusion conditions.
Exclusion conditions apply at both the user level and at the post level. At the user level, any users who did not have at least 100 posts in non-mental health subreddits before their self-reported diagnosis post were excluded. Two exclusion conditions applied at the post level.
- The mention of depression was required to be no more than 80 characters away from the part of the post matching a diagnosis pattern; posts that did not satisfy this constraint were ignored.
A mention of depression is defined as the occurrence of one of the following strings: "depression", "depresion", "depressive disorder", "major depressive", or "mild depressive". - Posts that matched a negative diagnosis pattern were ignored (e.g., "mother was diagnosed with").
We define a MH post to be any post that was made to a subreddit related to mental health or that matched a MH pattern. All of the diagnosed users' MH posts were removed. Users with no MH posts are candidate control users.
Control users were chosen by matching candidate control users with diagnosed users. Each diagnosed user was greedily matched with the 12 control users who had the smallest Hellinger distance between the diagnosed user's and the control user's subreddit post probability distributions, excluding control users with 10% more or fewer posts than the diagnosed user.
Statistics
This process yielded 9,210 diagnosed users, which were split into training, validation, and testing sets along with the diagnosed users' matched controls.
Diagnosed Users | Control Users | |
---|---|---|
Training | 3,070 | 35,753 |
Validation | 3,070 | 35,746 |
Testing | 3,070 | 35,775 |
Total | 9,210 | 107,274 |
Both the number of posts made by each user and the length of each post vary widely, as shown below.
Obtaining the data
The RSDD dataset contains only publicly available Reddit posts. Posts may contain information related to users' health, however, and are thus sensitive. To protect users' privacy, researchers who wish to obtain the dataset must sign a data usage agreement.
Succinctly, the agreement requires that researchers
- make no attempt to contact any user in the dataset
- make no attempt to deanonymize or learn the identity of any user in the dataset
- make no attempt to link users in the dataset with any external information (e.g., an account on another website)
- do not share any portion of the data, including example posts or excerpts from posts, with any other party
Researchers interested in obtaining the RSDD dataset may submit a data request form to be provided with the data usage agreement and further information on obtaining the data.