Working with The New York Times API in R

Data Journalism in R
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

Have you ever come across a resource that you didn't know existed, but once you find it you wonder how you ever got along without it? I had this feeling earlier this week when I came across the New York Times API. That's right, the paper of record allows you–with a little bit of programming skills–to query their entire archive and work with the data. Well, it's important to note that we don't get the full text of articles, but we do get a lot of metadata and URLs for each of the articles, which means it's not impossible to get the full text. But still, this is pretty cool.

So, let's get started! You're going to want to head over to http://developer.nytimes.com to get an API Key. While you're there, check out the selection of APIs on offer–there are over 10, including Article Search, Archive, Books, Comments, Movie Reviews, Top Stories, and more. I'm still digging into each of these myself, so today we'll focus on Article Search, and I suspect I'll revisit the NYT API in this space many times going forward. Also at NYT's developer site, you can use their API Tool feature to try out some queries without writing code. I found this helpful for wrapping my head around the APIs.

Okay, did you get your API Key? There are a couple of ways to store the key as a variable. The first, and more straightforward, is just to declare it like so:

NYTIMES_KEY="YOUR_API_KEY_HERE"

Or, you can set it as an environment variable:

Sys.setenv(NYTIMES_KEY="YOUR_API_KEY_HERE")

Obviously, this is not my real API Key, or yours. You'll replace “YOUR_API_KEY_HERE” with the actual API Key you get from NYT. One thing to note, when you create a new API Key at the Times' developer website, it will ask you which API you want access to. If you're like me, you want access to everything, so you might be tempted to create a new key for each API, but turns out this is not necessary. To be honest, it took me far too long to figure this out, but no matter which API you select from the dropdown, the key that is provided via email from NYT is the same. So, you just need to create a key once.

My first thought when I discovered the NYT API was that there must be a R package or two for interfacing with the API. And, of course, because the #rstats community is so wonderful, there are at least two. The first I came across is called rtimes and its main function is to “search and retrieve data from the New York Times congress API,” according to the vignette. The only problem is, as far as I can tell, the congress API doesn't exist anymore. It's not listed on NYT's developer site, and on a blog post announcing its creation back in 2009, there are links that now just lead to the main developer site (somebody correct me if I'm wrong about this). But, no worries, rtimes also has an article search functionality.

Another package for interacting with the NYT API is nytimes by Mike Kearney, who you might remember from the excellent rtweets package, which I wrote last time is replacing the twitteR package. Kearney's nytimes package is not in CRAN, but instead is available via GitHub, and can be installed using devtools, like so:

install.packages("devtools")
devtools::install_github("mkearney/nytimes")

With nytimes, you get access to the Article Search API, Most Popular API, and the Times Newswire API. This is a good way to get started, and Kearney provides some basic documentation on his GitHub site.

But, after trying out both of these packages, I ultimately decided I wanted to interact more directly with the NYT API. We can accomplish this using the mighty jsonlite package, which, according to the vignette, “is a JSON parser/generator optimized for the web. Its main strength is that it implements a bidirectional mapping between JSON data and the most important R data types.” In short, it makes working with APIs really easy. So, let's get that installed and loaded:

install.packages("jsonlite")
library(jsonlite)

What jsonlite let's us do is query the NYT APIs and convert the results into R-friendly formats like dataframes and lists. Let's start with a simple example. Robert Mueller is on the verge of becoming my favorite person in the news, so let's search his name using the Article Search API:

x <- fromJSON("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=mueller&api-key=YOUR_API_KEY_HERE")

So, the code is simple–we're creating a list called x with the results of our query “mueller”–but the results of the query are not so simple. It turns out that x is a list of 3 items, with a ton of nested lists and dataframes beneath it. It's worthwhile to spend some time peeking through this data to get a sense of how the NYT Article Search API formats the results, but in its current format, it's not very easy to work with. Fortunately, jsonlite allows us to flatten the list and convert it to a dataframe, like so:

x <- fromJSON("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=mueller&api-key=YOUR_API_KEY_HERE", flatten = TRUE) %>% data.frame()

Now, you should have a dataframe with 10 observations and 30 variables. If you view the dataframe you'll see it's still not pretty. The variable names, for example are things like response.docs.snippet and response.docs.pub_date. These names reflect the hierarchy of the list we just flattened, and while they're cumbersome to work with, we'll let them be for now. The next thing you'll notice about the x dataframe is that some of the variables, or columns, contain lists. For example, response.docs.keywords looks something like this:

list(name = c("persons", "organizations", "subject"), value = c("Mueller, Robert S III", "Federal Bureau of Investigation", "United States Politics and Government"), rank = 1:3, major = c("N", "N", "N"))
## $name
## [1] "persons"       "organizations" "subject"      
## 
## $value
## [1] "Mueller, Robert S III"                
## [2] "Federal Bureau of Investigation"      
## [3] "United States Politics and Government"
## 
## $rank
## [1] 1 2 3
## 
## $major
## [1] "N" "N" "N"

Again, it's not pretty, and we could unlist these as well, but again, for the purposes of getting to the fun stuff, I'll leave it alone for now. The next thing to notice is that the query only returned 10 results. Surely, there are more than 10 articles that mention “mueller.” It turns out, the NYT API returns 10 results at a time. Think of this like searching on Google; your query turns up thousands of results, but the first page only shows 10 or so, and to view the rest you need to click through to the next page. This is how the NYT API works as well. There are more results, but they're on subsequent pages.

It's possible to manually query each page (you add a page= parameter to the query), but that would be impractical. Once again, jsonlite to the rescue! In this vignette we learn how to write a function to query multiple pages and combine the results into one dataframe. Before we get their, however, let's create a more robust workflow for querying the API.

To start, let's declare some variables that we can use to piece together our query. I'm currently writing a dissertation chapter on media coverage of the Central Park Jogger Case, so that's the example I'll be using below.

While there are a number of parameters we can use in our query, let's start with a search term, a begin date, and an end date. To keep the results small for now, I'll limit the search to the date the crime took place, April 19, 1989, through the month of August:

# Let's set some parameters
term <- "central+park+jogger" # Need to use + to string together separate words
begin_date <- "19890419"
end_date <- "19890901"

Note, the comment in the above code indicating that to string together several words to search you need to add +. Then, the date is formatted in YYYYMMDD.

With these parameters set, we can then paste together a query like so:

baseurl <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,
                  "&begin_date=",begin_date,"&end_date=",end_date,
                  "&facet_filter=true&api-key=",NYTIMES_KEY, sep="")

Here we create a value called baseurl that we'll use in our query. But there's one more step before we can actually conduct our search. Remember the paging issue? Like I said, jsonlite gives us a function to overcome this by looping through and collecting the data for each page, but in order to use it, we need to know how many pages to loop through. If we don't use the exact right number of pages, the function will fail.

So here's a bit of hackery to make this work–as always, I'm completely open to the very real possibility that there is a better way to do this, but here's what I came up with. Using the baseurl, we'll query the API. Remember, this will only return the first 10 results and, as we're not flattening, it will be in the nested list format we saw at the beginning. But, for our current purposes, that's okay. Even though we only get 10 results at a time, the results tell us how many total hits there are. We can use this and a little bit of basic math to figure out how many pages we need. In this example, there are 97 hits. So 97 hits, divided by 10 (hits per page), equals 9.7 pages–9 full pages of 10 hits, plus a 10th page with 7 hits. Ten pages, easy. But, not so fast. The first page in the results is not 1, but 0. So the last page we need to query will be 10 – 1, or 9. We'll save that as a value.

initialQuery <- fromJSON(baseurl)
maxPages <- round((initialQuery$response$meta$hits[1] / 10)-1) 

Now we can use the function provided in the jsonlite documentation with a few minor tweaks. The for loop is set to loop through pages 0 to maxPages, as determined above. Then, for each page a dataframe is created by querying the baseurl and pasting on the page number. A list of the dataframes is kept so that, at the end, we can bind them altogether. One other thing that I added to the function is Sys.sleep(1), which basically tells your computer to take a break between queries. This is necessary because otherwise the NYT API shuts you down for attempting “too many requests.”

pages <- list()
for(i in 0:maxPages){
  nytSearch <- fromJSON(paste0(baseurl, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages[[i+1]] <- nytSearch 
  Sys.sleep(1) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8
## Retrieving page 9

Sit back and relax while the function retrieves each page. When it is complete, we can combine them all into one dataframe:

allNYTSearch <- rbind_pages(pages)

As expected, we get 97 observations. I encourage you dig in there and see what variables might be of interest to you. But, as we're already running kind of long here, I'll just show off a couple basic visualizations before wrapping up. I'll definitely be returning to this in future posts.

So, in what sections did coverage of the Central Park Jogger case appear?

# Visualize coverage by section
allNYTSearch %>% 
  group_by(response.docs.type_of_material) %>%
  summarize(count=n()) %>%
  mutate(percent = (count / sum(count))*100) %>%
  ggplot() +
  geom_bar(aes(y=percent, x=response.docs.type_of_material, fill=response.docs.type_of_material), stat = "identity") + coord_flip()

plot of chunk unnamed-chunk-20

Not surprisingly, most of the coverage was in the News section, but it's also worth noting the amount of Op-Ed, Editorial, and Letters dedicated to the Central Park Jogger case. This is interesting to me as I consider not just what happened, but how the media discussed it.

What about dates with the most coverage:

allNYTSearch %>%
  mutate(pubDay=gsub("T.*","",response.docs.pub_date)) %>%
  group_by(pubDay) %>%
  summarise(count=n()) %>%
  #filter(count >= 2) %>%
  ggplot() +
  geom_bar(aes(x=reorder(pubDay, count), y=count), stat="identity") + coord_flip()

plot of chunk unnamed-chunk-21

On May 2, six articles mentioned the Central Park Jogger, including two in the Metropolitan section and one each in Science, Editorial, Arts, and News Summary. This is interesting to me because it had been two weeks since the crime and nothing of particular importance (as far as I can tell) relating to the case happened on or just before that date. Hmm.

While there is so much more to explore, I will leave this here for now. I'd be interested to hear about any readers' experiments digging into this data, and/or what you'd like to see me look into more next time. Thanks for reading!

Jonathan D. Fitzgerald
Fitz is a PhD candidate in the English department at Northeastern University as well as a freelance journalist.

Leave a Reply