How to access APIs in R

Data Journalism in R, How to, Insights
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

APIs, or application program interfaces, are a way for people to access data in a plain text format using multiple programming languages. Many websites, organizations and services offer APIs for accessing their data, like Twitter, Wikipedia, Reddit and OpenSecrets.

This tutorial will walk you through accessing APIs from DataUSA and OpenSecrets in R Studio with the help of R’s httr and jsonlite packages.

A quick note: When a website has an API, it’s very important to read all the documentation. Most APIs require you to provide an email address for a key and some even require a justification for requesting and using their data. It’s also good manners to make sure you’re not overloading the server with requests.

To access data using an API we need to 1) send the HTTP data request with a specific set of instructions for the web server, and 2) receive and parse the response, which typically contains the data in either JSON or XML. We’ll use the httr package to access APIs and interact with data requests and we’ll use jasonlite package to interpret those requests.

Using the DataUSA.io API

We will start with an example from the DataUSA.io API. Read the documentation here. According to the docs, there are three endpoints in this API – Data, Attributes, and Search. The first table we will be accessing is the Attributes table, which stores information on the variables in the data table, “such as their name and profile image.”

Let’s look into the Degrees (degree) attributes by building an API request as a string vector and placing it inside the httr::GET() function.

NOTE: The httr::GET() function takes a url argument (not a uri argument), but it’s important to know the difference because of specific purposes they serve. More on this later.

Putting together your GET request

We’ll start by combining the http standard request for APIs with some specific information from the DataUSA website (http://api.datausa.io) into a single object, dusa_root_api.

dusa_root_api <- "http://api.datausa.io"

We then specify two additional components in the API request path:

  1. The attributes (attrs) resource for the data we'll be requesting
  2. The degrees (degree) path or sub-resource from attributes API

These are all separated by a forward slash, "/".

These two portions make up the domain and directory URI.

Now we'll combine these using stringr::str_c() into a dusa_api_attr_geo.

# combine these elements
dusa_api_attrs_degree <- stringr::str_c(dusa_root_api,
  # add attributes
  "attrs",
  # add degree data
  "degree",
  # separate by slash (like web URL)
  sep = "/"
)
dusa_api_attrs_degree

    ## [1] "http://api.datausa.io/attrs/degree"

By passing the API GET request (dusa_api_attrs_degree) to httr::GET() and assigning the output to another object (dusa_api_attrs_degree_GET), we can see the following information gets provided about our request:

dusa_api_attrs_degree_GET <- httr::GET(url = dusa_api_attrs_degree)
dusa_api_attrs_degree_GET

    ## Response [https://api.datausa.io/attrs/degree/]
    ##   Date: 2019-04-19 19:55
    ##   Status: 200
    ##   Content-Type: application/json
    ##   Size: 428 B

What is in the httr::GET() results

The following elements are included in this htrr::GET() result:

  • Date = time and date for the GET request
  • Status = Numerical code: 200 is a 'successful request', 404 is 'file not found', 403 is 'permission denied'
  • Content-Type = tells us the data file type from the API. In this case JSON.
  • Size = the size of the data file

This is a high-level overview of what is contained in the request, but we will have to dig deeper to understand more about extracting the data.

Check the status of the GET request with httr::http_status()

It's also good practice to check the status of an API request with the httr::http_status() function.

httr::http_status(dusa_api_attrs_degree_GET)

    ## $category
    ## [1] "Success"
    ## 
    ## $reason
    ## [1] "OK"
    ## 
    ## $message
    ## [1] "Success: (200) OK"

Now that we can see the GET request was successful, we should double-check the response using httr::content() to make sure the data are what we’re expecting. I want to specify as = text to make the output easier to read.

httr::content(dusa_api_attrs_degree_GET,
  as = "text"
)

  ## No encoding supplied: defaulting to UTF-8.

  ## [1] "{\"data\": [[\"19\", \"Other Doctorate\"], [\"18\",
 \"Professional Doctorate\"], [\"6\", \"Postbaccalaureate 
Certificate\"], [\"3\", \"Associate's Degree\"], [\"8\", \"Post-
Master's Certificate\"], [\"17\", \"Research Doctorate\"], 
[\"2\", \"1 to 2 Year Postsecondary Certificate\"], [\"4\", \"2
 to 4 Year Postsecondary Certificate\"], [\"5\", \"Bachelor's 
Degree\"], [\"7\", \"Master's Degree\"], [\"1\", \"< 1 Year 
Postsecondary Certificate\"]], \"headers\": [\"id\", \"name\"]}"

The message tells me the response is using the default text encoding (UTF-8). The good news is that the data look like college degrees, so we can be confident the API request is working!

How it works: The httr::content() function uses the Content-Type data from httr::GET() to determine the best way to parse the incoming data. In this case, it will call the jsonlite::fromJSON() because we learned the data being returned are application/json.

What is 'JavaScript Object Notation' (JSON) data?

Most APIs return data in a JavaScript Object Notation (JSON) format (pronounced "Jay-son"). The JSON format is beneficial because 1) it's a plain text file, and 2) it doesn't need to be structured in a tabular data frame (i.e. is can store 'non-rectangular' or unstructured text easily).

Below is a schematic of a basic JSON object.

Reading JSON data into R

We will need to pass dusa_api_attrs_degree directly to jsonlite::fromJSON() and check it's class, we see it is read into R as a list()

library(jsonlite)
degree_attributes <- jsonlite::fromJSON(dusa_api_attrs_degree)
base::class(degree_attributes)

    ## [1] "list"

The degree_attributes list has two elements, data and headers.

degree_attributes %>% str()

List of 2
 $ data   : chr [1:11, 1:2] "19" "18" "6" "3" ...
 $ headers: chr [1:2] "id" "name"

We will pass the data to tibble::as_tibble() and assign the headers to the column names with magrittr::set_colnames().


DegreeAttr <- tibble::as_tibble(degree_attributes$data) %>%
  magrittr::set_colnames(value = degree_attributes$headers)
DegreeAttr %>% glimpse(78)

Observations: 11
Variables: 2
$ id    "19", "18", "6", "3", "8", "17", "2", "4", "5", "7", "1"
$ name  "Other Doctorate", "Professional Doctorate", "Postbaccalaureat

Now we've converted the contents of the API request to a data frame! Let's repeat this process, but with a slightly more complicated request from a different API.

Getting data from the OpenSecrets API

You can send more specific requests using API requests, too. To demonstrate this, we'll be using the opensecrets.org API. This requires you to sign up for an access key here.

API queries

After you've signed up and have an API access key, you'll need to read up on the documentation for the available data. For this example, I'll be downloading the data from the candContrib table, which contains information on the "top contributors to specified candidate for a House or Senate seat or member of Congress."

In the documentation, an example API query is presented and I've represented each component in the figure below:

API queries follow a general syntax (called query parameters) for accessing various resources on the web server. We will build a new request using the same method as above, but with a few additional specifications.

The base domain (API)

The first portion of this should be familiar from the previous request we built--it contains the http and domain information.

opensec_root <- stringr::str_c("https://", "www.opensecrets.org/api/")
opensec_root
    ## [1] "https://www.opensecrets.org/api/"

Our query will start at the end of the domain api (opensec_root in our case) with a question mark (?). Next we will add the data source we are interested in candContrib.

opensec_candContrib <- stringr::str_c(opensec_root, "?method=candContrib")
opensec_candContrib

## [1] "https://www.opensecrets.org/api/?method=candContrib"

We will add the cycle information (cycle) to limit the amount of data to the year 2018.

opensec_candContrib18 <- stringr::str_c(opensec_candContrib, "&cycle=2018")
opensec_candContrib18

## [1] "https://www.opensecrets.org/api/?method=candContrib&cycle=2018"

Next we need to specify the kind of data we want the request to return by including an output=json parameter.

opensec_candContrib18JSON <- stringr::str_c(
  opensec_candContrib18,
  "&output=json"
)
opensec_candContrib18JSON

## [1] "https://www.opensecrets.org/api/?method=candContrib&cycle=2018&output=json"

Now I will include the API key, which was given to me when I signed up on opensecrets. This API key should be stored in a separate file, so it doesn't get unintentionally shared or distributed.

source("code/api-key.R")

The api-key.R file contains my API key in a string opensecrets_api_key that I can combine with opensec_candContrib18JSON

opensec_candContrib18JSONapi <- stringr::str_c(
  opensec_candContrib18JSON,
  "&apikey=",
  opensecrets_api_key
)

Finally, I'll include the cid which is the unique identifier for candidates. These are available for download here in the data documentation. We'll import the 2018 candidates sheet from this file below.

library(readxl)
CRPID2018 <- read_excel("meta/CRP_IDs.xls",
  sheet = "Candidate Ids - 2018",
  range = "B14:F3676"
)
CRPID2018 %>% glimpse(78)

    ## Observations: 3,662
    ## Variables: 5
    ## $ CID           "N00034296", "N00040877", "N00040419", "N00041028", "N…
    ## $ CRPName       "Aalders, Tim", "Aarestad, David", "Abatecola, Bill", …
    ## $ Party         "3", "D", "R", "D", "D", "D", "R", "R", "D", "R", "D",…
    ## $ DistIDRunFor  "UTS1", "CO06", "AZ03", "AZS2", "MN05", "GA06", "ND01"…
    ## $ FECCandID     "S2UT00229", "H8CO06237", "H6AZ03245", "S8AZ00122", "H…

I am interested in Cory Booker, so we'll use this lookup table to try and find his cid.

CRPID2018 %>%
  dplyr::filter(stringr::str_detect(
    string = CRPName,
    pattern = "Booker"
  ))

    ## # A tibble: 1 x 5
    ##   CID       CRPName        Party DistIDRunFor FECCandID
    ##                               
    ## 1 N00041440 Booker, Tykiem D     DES1         S8DE00111

This shows a Tykiem Booker, but this isn't who I am looking for. Fortunately, I just learned a bit about how url's get built. I will return to the opensecrets website and search for Cory Booker.

This gives us the following search results, and I want to choose the second result down titled, "Sen. Cory Booker - Campaign Finance Summary • OpenSecrets".

I chose this option because I can see cid is listed in the url. After clicking on the link, I can see that the url contains syntax that looks like the query parameters I've been building.

I include the cid=N00035267 in my string vector.

opensecrets_candContribCall <- stringr::str_c(
  opensec_candContrib18JSONapi,
  "&cid=N00035267"
)

Check the API GET request

Now I can use the httr::http_status() function on my opensecrets_candContribCall string.

httr::http_status(httr::GET(opensecrets_candContribCall))
    ## $category
    ## [1] "Success"
    ## 
    ## $reason
    ## [1] "OK"
    ## 
    ## $message
    ## [1] "Success: (200) OK"

Check the status of the GET request with httr::http_status() and for any errors with httr::http_error().

httr::http_error(httr::GET(opensecrets_candContribCall))
    ## [1] FALSE

Pass GET requests directly to fromJSON()

I also have the option to pass my API path to the jsonlite::fromJSON() function. I will store this in the opensec_json_query object.

opensec_json_query <- jsonlite::fromJSON(txt = opensecrets_candContribCall)
utils::str(opensec_json_query)

## List of 1
    ##  $ response:List of 1
    ##   ..$ contributors:List of 2
    ##   .. ..$ @attributes:List of 6
    ##   .. .. ..$ cand_name: chr "Cory Booker (D)"
    ##   .. .. ..$ cid      : chr "N00035267"
    ##   .. .. ..$ cycle    : chr "2018"
    ##   .. .. ..$ origin   : chr "Center for Responsive Politics"
    ##   .. .. ..$ source   : chr "https://www.opensecrets.org/members-of-congress/contributors?cid=N00035267&cycle=2018"
    ##   .. .. ..$ notice   : chr "The organizations themselves did not donate, rather the money came from the organization's PAC, its individual "| __truncated__
    ##   .. ..$ contributor:'data.frame':   10 obs. of  1 variable:
    ##   .. .. ..$ @attributes:'data.frame':    10 obs. of  4 variables:
    ##   .. .. .. ..$ org_name: chr [1:10] "Paul, Weiss et al" "NorPAC" "Sullivan & Cromwell" "Gibbons PC" ...
    ##   .. .. .. ..$ total   : chr [1:10] "177020" "158871" "138150" "134375" ...
    ##   .. .. .. ..$ pacs    : chr [1:10] "0" "4827" "0" "10000" ...
    ##   .. .. .. ..$ indivs  : chr [1:10] "177020" "154044" "138150" "124375" 

We can see this is a list of one, and each object inside the list has data on Cory Booker. If I start investigating the contents of this list, I can see the actual data are embedded inside a few layers.

This is where RStudio comes in handy. I can use dplyr::glimpse() in an Rmarkdown code chunk to quickly view the multiple objects inside the opensec_json_query list.

As you may have seen, there are two objects with @attributes in the opensec_json_query list. The first contains information on our candidate and is stored in the object below.

# note the special notation for objects starting with @
opensec_json_query$response$contributors$`@attributes`

    ## $cand_name
    ## [1] "Cory Booker (D)"
    ## 
    ## $cid
    ## [1] "N00035267"
    ## 
    ## $cycle
    ## [1] "2018"
    ## 
    ## $origin
    ## [1] "Center for Responsive Politics"
    ## 
    ## $source
    ## [1] "https://www.opensecrets.org/members-of-congress/contributors?cid=N00035267&cycle=2018"
    ## 
    ## $notice
    ## [1] "The organizations themselves did not donate, rather the money came from the organization's PAC, its individual members or employees or owners, and those individuals' immediate families."

We can store this in a tibble using tibble::as_tibble()

CandAttributes <- tibble::as_tibble(opensec_json_query$response$contributors$`@attributes`)
CandAttributes %>% dplyr::glimpse(78)

    ## Observations: 1
    ## Variables: 6
    ## $ cand_name  "Cory Booker (D)"
    ## $ cid        "N00035267"
    ## $ cycle      "2018"
    ## $ origin     "Center for Responsive Politics"
    ## $ source     "https://www.opensecrets.org/members-of-congress/contribu…
    ## $ notice     "The organizations themselves did not donate, rather the …

The other @attributes object is a data frame with four variables: org_name, total, pacs, indivs. I will use the same function to create a tibble from donors and call it BookContribs.

BookContribs <- tibble::as_tibble(opensec_json_query$response$contributors$contributor$`@attributes`)
BookContribs

    ## # A tibble: 10 x 4
    ##    org_name                 total  pacs  indivs
    ##                            
    ##  1 Paul, Weiss et al        177020 0     177020
    ##  2 NorPAC                   158871 4827  154044
    ##  3 Sullivan & Cromwell      138150 0     138150
    ##  4 Gibbons PC               134375 10000 124375
    ##  5 Greenberg Traurig LLP    90710  9000  81710 
    ##  6 Apollo Global Management 83500  2000  81500 
    ##  7 Goldman Sachs            80400  5000  75400 
    ##  8 Cole, Schotz et al       79198  0     79198 
    ##  9 Walt Disney Co           78662  1000  77662 
    ## 10 Morgan Stanley           78120  7000  71120

I want to add a few elements from BookAttributes to BookContribs and call it BookerCont2018

BookerCont2018 <- BookContribs %>%
  add_column(
    cand_name = as_vector(CandAttributes$cand_name),
    origin = as_vector(CandAttributes$origin),
    source = as_vector(CandAttributes$source)
  )
BookerCont2018 %>% dplyr::glimpse(78)

    ## Observations: 10
    ## Variables: 7
    ## $ org_name   "Paul, Weiss et al", "NorPAC", "Sullivan & Cromwell", "Gi…
    ## $ total      "177020", "158871", "138150", "134375", "90710", "83500",…
    ## $ pacs       "0", "4827", "0", "10000", "9000", "2000", "5000", "0", "…
    ## $ indivs     "177020", "154044", "138150", "124375", "81710", "81500",…
    ## $ cand_name  "Cory Booker (D)", "Cory Booker (D)", "Cory Booker (D)", …
    ## $ origin     "Center for Responsive Politics", "Center for Responsive …
    ## $ source     "https://www.opensecrets.org/members-of-congress/contribu…

Now we can do a little tidying and visualizing to create something that looks like data on the opensecrets page.

library(scales)
BookerCont2018 %>%
  # tidy these data
  tidyr::pivot_longer(
    cols = total:indivs,
    names_to = c("cont_type"),
    values_to = "cont_amount"
  ) %>%
  # convert to numeric
  dplyr::mutate(
      # recode missing
    cont_amount = if_else(cont_amount == "0",
      true = NA_character_,
      false = cont_amount),
    # convert to numeric
    cont_amount = as.numeric(cont_amount),
    # factor org_name
    org_name_fct = factor(org_name)
    ) %>% 
  dplyr::filter(cont_type != "total") %>%
  ggplot(data = .) +
  geom_col(aes(x = org_name_fct,
               y = cont_amount,
               fill = cont_type), position = "dodge") + 
    labs(
        title = "2018 Contributions to Cory Booker (D)",
        y = "Contribution amount",
        x = "Contributor",
        caption = "opensecrets.org",
        fill = "Contributor types") +
    theme(axis.text.x = element_text(angle = -40, 
                                     vjust = 0.7, 
                                     hjust = 0)) +
    scale_y_continuous(labels = scales::dollar_format(prefix = "$"))

There we have it! This looks similar to the data we see on the Opensecrets Cory Booker profile page.

Recap

We've walked through how to access API data using httr and jsonlite packages. This tutorial only scratched the surface, so be sure to check out the links below for more details.

Martin Frigaard
Martin Frigaard is a tidyverse/R trainer in Oakland, CA. Find him on Twitter.

Leave a Reply