How to scrape Reddit with Python

How to
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

Last month, Storybench editor Aleszu Bajak and I decided to explore user data on nootropics, the brain-boosting pills that have become popular for their productivity-enhancing properties. Many of the substances are also banned by at the Olympics, which is why we were able to pitch and publish the piece at Smithsonian magazine during the 2018 Winter Olympics. For the story and visualization, we decided to scrape Reddit to better understand the chatter surrounding drugs like modafinil, noopept and piracetam.

In this Python tutorial, I will walk you through how to access Reddit API to download data for your own project.

This is what you will need to get started:

  • Python 3.x: I recommend you use the Anaconda distribution for the simplicity with packages. You can also download Python from the project’s website. When following the script, pay special attention to indentations, which are a vital part of Python.
  • An IDE (Interactive Development Environment) or a Text Editor: I personally use Jupyter Notebooks for projects like this (and it is already included in the Anaconda pack), but use what you are most comfortable with. You can also run scripts from the command-line.
  • These two Python packages installed: Praw, to connect to the Reddit API, and Pandas, which we will use to handle, format, and export data.
  • A Reddit account. You can create it here.

The Reddit API

The very first thing you’ll need to do is “Create an App” within Reddit to get the OAuth2 keys to access the API. It is easier than you think.

Go to this page and click create app or create another app button at the bottom left.

This form will open up.

Pick a name for your application and add a description for reference. Also make sure you select the “script” option and don’t forget to put http://localhost:8080 in the redirect uri field. If you have any doubts, refer to Praw documentation.

Hit create app and now you are ready to use the OAuth2 authorization to connect to the API and start scraping. Copy and paste your 14-characters personal use script and 27-character secret key somewhere safe. You application should look like this:


The “shebang line” and importing packages and modules

We will be using only one of Python’s built-in modules, datetime, and two third-party modules, Pandas and Praw. The best practice is to put your imports at the top of the script, right after the shebang line, which starts with #!. It should look like:

#! usr/bin/env python3
import praw
import pandas as pd
import datetime as dt

The “shebang line” is what you see on the very first line of the script #! usr/bin/env python3. You only need to worry about this if you are considering running the script from the command line. The shebang line is just some code that helps the computer locate python in the memory. It varies a little bit from Windows to Macs to Linux, so replace the first line accordingly:

On Windows, the shebang line is #! python3.

On Linux, the shebang line is #! /usr/bin/python3.

Getting Reddit and subreddit instances

PRAW stands for Python Reddit API Wrapper, so it makes it very easy for us to access Reddit data. First we connect to Reddit by calling the praw.Reddit function and storing it in a variable. I’m calling mine reddit. You should pass the following arguments to that function:

reddit = praw.Reddit(client_id='PERSONAL_USE_SCRIPT_14_CHARS', \
                     client_secret='SECRET_KEY_27_CHARS ', \
                     user_agent='YOUR_APP_NAME', \
                     username='YOUR_REDDIT_USER_NAME', \

From that, we use the same logic to get to the subreddit we want and call the .subreddit instance from reddit and pass it the name of the subreddit we want to access. It can be found after “r/” in the subreddit’s URL. I’m going to use r/Nootropics, one of the subreddits we used in the story.

Also, remember assign that to a new variable like this:

subreddit = reddit.subreddit('Nootropics')

Accessing the threads

Each subreddit has five different ways of organizing the topics created by redditors: .hot, .new, .controversial, .top, and .gilded. You can also use .search("SEARCH_KEYWORDS") to get only results matching an engine search.

Let’s just grab the most up-voted topics all-time with:

top_subreddit =

That will return a list-like object with the top-100 submission in r/Nootropics. You can control the size of the sample by passing a limit to .top(), but be aware that Reddit’s request limit* is 1000, like this:

top_subreddit =

*PRAW had a fairly easy work-around for this by querying the subreddits by date, but the endpoint that allowed it is soon to be deprecated by Reddit. We will try to update this tutorial as soon as PRAW’s next update is released.
There is also a way of requesting a refresh token for those who are advanced python developers.

Parsing and downloading the data

We are right now really close to getting the data in our hands. Our top_subreddit object has methods to return all kinds of information from each submission. You can check it for yourself with these simple two lines:

for submission in

For the project, Aleszu and I decided to scrape this information about the topics: title, score, url, id, number of comments, date of creation, body text. This can be done very easily with a for lop just like above, but first we need to create a place to store the data. On Python, that is usually done with a dictionary. Let’s create it with the following code:

topics_dict = { "title":[], \
                "score":[], \
                "id":[], "url":[], \ 
                "comms_num": [], \
                "created": [], \

Now we are ready to start scraping the data from the Reddit API. We will iterate through our top_subreddit object and append the information to our dictionary.

for submission in top_subreddit:

Python dictionaries, however, are not very easy for us humans to read. This is where the Pandas module comes in handy. We’ll finally use it to put the data into something that looks like a spreadsheet — in Pandas, we call those Data Frames.

topics_data = pd.DataFrame(topics_dict)

The data now looks like this:

Fixing the date column

Reddit uses UNIX timestamps to format date and time. Instead of manually converting all those entries, or using a site like, we can easily write up a function in Python to automate that process. We define it, call it, and join the new column to dataset with the following code:

def get_date(created):
    return dt.datetime.fromtimestamp(created)
_timestamp = topics_data["created"].apply(get_date)
topics_data = topics_data.assign(timestamp = _timestamp)

The dataset now has a new column that we can understand and is ready to be exported.

Exporting a CSV

Pandas makes it very easy for us to create data files in various formats, including CSVs and Excel workbooks. To finish up the script, add the following to the end.

topics_data.to_csv('FILENAME.csv', Index=False) 

That is it. You scraped a subreddit for the first time. Now, let’s go run that cool data analysis and write that story.

If you have any questions, ideas, thoughts, contributions, you can reach me at @fsorodrigues or fsorodrigues [ at ] gmail [ dot ] com.

Felippe Rodrigues
Felippe is a former law student turned sports writer and a big fan of the Olympics. He is currently a graduate student in Northeastern’s Media Innovation program.

Leave a Reply