Twitter + R

Data Journalism in R, How to
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

Introducing twitteR

If you've worked with R in any capacity, you've probably noticed by now that developers who create R packages love playing with the letter “R” in their naming schema. Well, twitteR is no different. The twitteR package was created by Jeff Gentry to “provide access to the Twitter API within R, allowing users to grab interesting subsets of Twitter data for their analyses,” according to Gentry's user vignette, which you can read here. There are some good tutorials online for using twitteR (I learned, initially, by walking through this one), but here I'll try to focus our efforts on ways journalists might want to interact with Twitter data in R.

Getting Started

Let's start in RStudio by downloading and installing twitteR.

(This post is going to assume you've downloaded and installed RStudio as well as set up a new project. If you haven't, I recommend this post.)

Right, so to install twitteR, you'll need to do this (note the quotation marks):

install.packages("twitteR")

Once you've successfully installed the package, you won't need to do it again, but you will need to load the package whenever you start a new R session, and you can do that like this (note there are no quotation marks this time):

library(twitteR)

Great, now you're ready to go from the R side of things, but to get up and running, you'll need to head over to Twitter's Apps page. Of course, you'll have to already be signed up for Twitter (you are, right? If not, do that first). Once there, click the “Create New App” button and fill in the details for “Name”, “Description”, and “Website”. These are required, and the name has to be unique, which can be tricky. It took me a few tries to find one that works. If you don't have a website, feel free to put StoryBench's URL in there. Or Google's. Or anything. Twitter notes: “If you don't have a URL yet, just put a placeholder here but remember to change it later.” Then, of course, you have to click the checkbox that says you agree to the user agreement.

Once you're in, click over to the “Keys and Access Tokens” page and click the “Create my access token” button on the bottom. Now you should have the four bits of information you're going to need to connect R to Twitter: Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret.

Now, head back to RStudio, and enter the following (substituting your actual keys and tokens for the placeholder text below, of course):

consumer_key <- "your_consumer_key"
consumer_secret <- "your_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Upon running this, you'll get a question in your Console asking:

Use a local file ('.httr-oauth'), to cache OAuth access credentials between R sessions?

1: Yes
2: No

I recommend answering Yes as this will make it so you don't have to enter your Twitter credentials every time you start a new session.

## [1] "Using direct authentication"

Searching Twitter

Okay, let's see if it's up and running as expected by trying to search for some tweets. At the risk of getting overly political, I'm going to use an example I was playing with shortly after Trump's election in November. If you recall, there was a burst of hate-related incidents after the election and I was working with a team to see if we could collect and archive these. One hashtag that people were using on Twitter to mark these incidents was #trumpsamerica. Let's do a search on that and see what we can find.

listTA <- searchTwitter('#trumpsamerica')

The first line searches Twitter for #trumpsamerica and stores the results in a list, which I've named listTA. I find lists difficult to read, so let's convert that to data.frame using the twListToDF function from the twitteR package.

dfTA <- twListToDF(listTA)

Now you should have a data.frame with a number of columns that tell you basically everything you want to know about the tweets that the search returned including the text, the date it was created, the screenname of the user who tweeted it, and much more. The first ten rows of the data.frame looks like this:

plot of chunk unnamed-chunk-7

Also note that by default the search grabs 25 recent tweets. You can manually specify how many tweets you want to pull, but last time I checked, the limit set by Twitter is 3200. Here's how to specify the number of tweets to retrieve:

listTA = searchTwitter('#trumpsamerica', n = 500, since = '2017-11-08')
dfTA = twListToDF(listTA)

I also added in the since argument, just for fun. You should now have a dataframe that contains the 500 most recent tweets that use the hashtag #trumpsamerica. (Note that the since argument didn't do much here since there were more than 500 tweets since November 8th.)

The twitteR package can do a lot more than just search for tweets, so I encourage you to check out the documentation here. But, just to wrap this up, let's try one more thing…

Fun with Visualizations

Let's create a visualization that shows when these tweets were posted. You'll need to install and load the ggplot2 package if you haven't already.

library(ggplot2)

ggplot(dfTA, aes(created)) + 
  geom_density(aes(fill = isRetweet), alpha = .5) +
  theme(legend.justification = c(1, 1), legend.position = c(1, 1)) +
  xlab('All tweets')

plot of chunk unnamed-chunk-10

So, here we created a density plot showing the density of tweets over time and shaded to reflect whether the tweets were original or retweets. Pretty cool, right?

What if we wanted to see the most common words in these #trumpsamerica tweets? With a little help from the tidytext package, we can tokenize the tweets, remove stopwords (using a list from the tokenizers package) and total up the words, like so:

tidy_dfTA <- dfTA %>% 
  unnest_tokens(word, text, stopwords = stopwords()) %>% 
  group_by(word) %>% 
  mutate(count=n())

Let's visualize that:

tidy_dfTA %>% 
  filter(count >= 20, word != "https", word != "t.co", word != "rt") %>% 
  ggplot() +
  geom_bar(aes(x=reorder(word, count), fill=word)) + coord_flip() +
  guides(fill=FALSE) 

plot of chunk unnamed-chunk-14

Go ahead and experiment with other searches and let us know what you find in the comments below. And stay tuned for more from Data Journalism in R.

Photo by Benjamin Balázs on Unsplash

Jonathan D. Fitzgerald

Fitz is a PhD candidate in the English department at Northeastern University as well as a freelance journalist.

Leave a Reply