Tutorials

How to get census data in 5 minutes using R and tidycensus

This article was originally published on Medium.

Do you get tired of grabbing data directly off of census.data.gov? Or has the Census API been throwing errors in your code? Maybe you are just like the rest of us, wanting to streamline your workflows as much as possible.

Well, Dr. Kyle Walker had all of us census data users in mind when developing tidycensus, an R package that makes obtaining census data unbelievably easy.

Let’s walk through how to obtain census data in under five minutes, using tidycensus.

Acquiring an API key from the U.S. Census Bureau

Before you can begin unlocking the hidden superpowers of tidycensus, you must first acquire a free API Key from the U.S. Census Bureau. Click here to do so.

You should receive an email within a few minutes that includes your new key. This is crucial because the tidycensus package is built off of the Census API, meaning that none of the functions in the package will work without one.

Luckily, tidycensus has a neat little function to quickly install the key onto your computer. But first, we need to get the package installed. Here’s how:

Install tidycensus

As like any other package in R, the first step to begin using its functions is to install it into your IDE of choice (my personal preference is RStudio).

install.packages("tidycensus")

Initializing your API key

Now that we have the package loaded, we can make use of that nifty function for installing the API key that I mentioned earlier.

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Notice how there are two pieces to this function: the API key itself (make sure to put it inside of quotes) and the install argument, which in this case is set to TRUE. You will only need to use this line of code once, which is when you load the key for the first time.

The install = TRUE argument is telling your computer to essentially remember this key and use it every time you make an API call. This eliminates you from having to do anything in this process again on this device, so long as your key remains valid.

The core functions

There are two core functions that will be the basis of working with tidycensus:

Both operate very similarly and utilize the following arguments to execute the proper API requests:

  • geography: the geographic level which you would like to your data to be parsed out by. See here for the available geographies for each survey.
  • state: the state in which you are selecting data from. Note: if you set geography = “state” you can leave the state argument out of the call entirely, resulting in data at the state level for the entire U.S. Similarly, if you do not include either in the call, you will get data at the national level.
  • variables: here is where you enter the variables you would like to select — hold tight for a crash course on how to make easy use of this argument.
  • year: the year of data you would like to obtain. The ACS happens every year, while the Decennial Census happens only once every 10 years.
  • sumfile: unique to the get_decennial() function, this argument tells the API which summary file to ask for.
DON’T MISS  How to get started with Python

Variable selection

One of the best parts about this package is how easy it makes it to identify your desired variable names by making use of the load_variables() function.

This eliminates the need to go manual searching for every variable name you want online. With a few simple lines of code, you can have a searchable table full out variable names (along with their more detailed names for reference.)

Use the following code to create objects containing the list of variable names from a few different surveys.

# 2020 Decennial Census Variables
decennial_2020_vars <- load_variables(
year = 2020,
"pl",
cache = TRUE
)# 2010 Decennial Census Variables
decennial_2010_vars <- load_variables(
year = 2010,
"pl",
cache = TRUE
)# 2016 - 2020 5 Year American Community Survey (ACS) Variables
acs_20_vars = load_variables(
year = 2020,
"acs5",
cache = TRUE
)

You can now access these tables and use the search function in RStudio to quickly identify the variable names you want.

Once you have a list of variables together, you can save them all as a list and pass that through tidycensus to retrieve their corresponding values in a tidy data frame (and even rename them in the process). Let’s take a look:

desired_vars = c(
all = "P2_001N",
hisp = "P2_002N",
white = "P2_005N",
baa = "P2_006N",
amin = "P2_007N",
asian = "P2_008N",
nhopi = "P2_009N",
other = "P2_010N",
multi = "P2_011N"
)

Passing them through the get_decennial() function:

census_data = get_decennial(
geography = "county",
state = "NC",
variables = vars_reth, <---- here is where I am using the list
summary_var = "P2_001N", <--- creates a column w/'total' variable
year = 2020,
sumfile = "pl"
)

The above code would return a data table containing all of the variables as defined by the list object desired_vars. In addition to that, there will be a new column created by the summary_var argument. This data point represents a summary variable, or the total number of all sub-variables combined.

DON’T MISS  How to analyze the screen times of presidential candidates

In other words, if you total up all of the race and ethnicity subsectors, that would equal the summary variable for Race & Ethnicity data.

(This comes in handy when wanting to show composition by allowing you to quickly roll up percentage.)

ACS tables

When searching for ACS data, there is another neat trick up tidycensus’ sleeve — the table argument.

Using table = ‘enter table name here’, one can easily acquire an entire table from the ACS, rather than typing out a list of variable names one by one.

# Income Data by County for North Carolina
nc_county_income = get_acs(
geography = "county",
state = "NC",
table = "B19001")## Note that leaving the 'year' argument blank tells the API to return the most recent year available. As of writing this, that is 2020 for both the ACS and Decennial Census.

Putting it all together

Now that we have covered the basics of tidycensus, let’s gather some data with it. Here is an example, from start to finish, of how to gather race and ethnicity data for every county in New York State:

# Load Libraries
library(tidycensus)
library(tidyverse)

# Load Your API Key 9if needed)
census_api_key("YOUR KEY GOES HERE", install = TRUE)

# Select Variables
desired_vars = c(
        all = "P2_001N", # All Residents
        hisp = "P2_002N", # Hispanic
        white = "P2_005N", # White
        baa = "P2_006N", # Black or African American
        amin = "P2_007N", # Native American(American Indian in data)
        asian = "P2_008N", # Asian
        nhopi = "P2_009N", # Native Hawaiian or Pacific Islander
        other = "P2_010N", # Some Other Race
        multi = "P2_011N" # Two or More Races
       )
reth_NY_20 = get_decennial(
  geography = "county",
  state = "NY",
  variables = desired_vars,
  summary_var = "P2_001N", # Same as 'All'
  year = 2020
)

And there you have it, folks! Census data — easily acquired — in less than 5 minutes with tidycensus.

Here is a link to the GitHub repository containing all of the code from this post.

Thomas Gomes
Latest posts by Thomas Gomes (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *

Get the latest from Storybench

Keep up with tutorials, behind-the-scenes interviews and more.