Overview

The RSDD (Reddit Self-reported Depression Diagnosis) dataset consists of Reddit posts for approximately 9,000 users who have claimed to have been diagnosed with depression ("diagnosed users") and approximately 107,000 matched control users. All posts made to mental health-related subreddits or containing keywords related to depression were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.

Further dataset construction details are available below and in Section 3.1 of the EMNLP 2017 paper Depression and Self-Harm Risk Assessment in Online Forums.

User Selection

The group of diagnosed users is made of users who (1) have a post containing a high-precision diagnosis pattern (e.g., "I was diagnosed with") and a mention of depression, and (2) do not match any exclusion conditions.

Exclusion conditions apply at both the user level and at the post level. At the user level, any users who did not have at least 100 posts in non-mental health subreddits before their self-reported diagnosis post were excluded. Two exclusion conditions applied at the post level.

  1. The mention of depression was required to be no more than 80 characters away from the part of the post matching a diagnosis pattern; posts that did not satisfy this constraint were ignored.
    A mention of depression is defined as the occurrence of one of the following strings: "depression", "depresion", "depressive disorder", "major depressive", or "mild depressive".
  2. Posts that matched a negative diagnosis pattern were ignored (e.g., "mother was diagnosed with").
Users who both matched a diagnosis pattern and did not match any exclusion condition were considered for inclusion. Crowdsourced annotators were asked to view the posts matching a diagnosis pattern and indicate whether the user was truly claiming to have been diagnosed with depression.

We define a MH post to be any post that was made to a subreddit related to mental health or that matched a MH pattern. All of the diagnosed users' MH posts were removed. Users with no MH posts are candidate control users.

Control users were chosen by matching candidate control users with diagnosed users. Each diagnosed user was greedily matched with the 12 control users who had the smallest Hellinger distance between the diagnosed user's and the control user's subreddit post probability distributions, excluding control users with 10% more or fewer posts than the diagnosed user.

Statistics

This process yielded 9,210 diagnosed users, which were split into training, validation, and testing sets along with the diagnosed users' matched controls.

Diagnosed Users Control Users
Training 3,070 35,753
Validation 3,070 35,746
Testing 3,070 35,775
Total 9,210 107,274


Both the number of posts made by each user and the length of each post vary widely, as shown below.

Empirical CDF of the number of posts per user
Empirical CDF of the length of posts in tokens

Obtaining the data

The RSDD dataset contains only publicly available Reddit posts. Posts may contain information related to users' health, however, and are thus sensitive. To protect users' privacy, researchers who wish to obtain the dataset must sign a data usage agreement.

Succinctly, the agreement requires that researchers

  • make no attempt to contact any user in the dataset
  • make no attempt to deanonymize or learn the identity of any user in the dataset
  • make no attempt to link users in the dataset with any external information (e.g., an account on another website)
  • do not share any portion of the data, including example posts or excerpts from posts, with any other party

Researchers interested in obtaining the RSDD dataset may submit a data request form to be provided with the data usage agreement and further information on obtaining the data.