How to explore and manipulate a dataset from the fivethirtyeight package in R

How to

The fivethirtyeight R package – released by Albert Y. Kim, Chester Ismay, and Jennifer Chunn last March – contains dozens of datasets used in FiveThirtyEight news articles like “A Handful Of Cities Are Driving 2016’s Rise In Murders,” “The Best MLB All-Star Teams Ever,” and “The Dallas Shooting Was Among The Deadliest For Police In U.S. History.” This tutorial will explore the murder_2015_final dataset using tidyr principles like tibble, gather, arrange and separate from our tidyverse tutorial.

Install and load the package

Using RStudio, we’ll install and then load the fivethirtyeight package, as well as tidyr, tibble and dplyr and then call murder_2015_final.


library(fivethirtyeight)
library(tidyr)
library(tibble)
library(dplyr)
murder_2015_final

Look at the dataset’s column names

Use names(murder_2015_final) to list out the dataset’s column names.

Gather variables into a single column

Lets gather up the two year variables, murders_2014 and murders_2015 into a single year column we’ll name murder_year. We’ll store the number of murders in a column titled murders and call this new object murders_gathered.

murders_gathered <- murder_2015_final %>% 
    gather(
        murder_year,
        murders,
        murders_2014:murders_2015,
        na.rm = TRUE)
murders_gathered

Arrange data alphabetically by state and city

Now let’s arrange this data alphabetically by state and city. We can do this with arrange() from the dplyr package. (We’ll learn more about dplyr in the next tutorial!)

murders_arranged <- murders_gathered %>% 
    arrange(
        state, 
        city)
murders_arranged

So now we have the two years in a single column (murder_year), but the repeating murder_ text is redundant. I want to get the year in a column by itself.

Separate “murder_year” column into “text” and “year”

Now I want to put the murder years (2014 & 2015) back into two different columns. I’ll do this with separate().

The separate() command uses a name of the existting column we want to separate (murder_year), and the names of the columns that will contain the separated values (c("text", "year")).

murders_separate <- murders_arranged %>%
    separate(
        murder_year,
            into = c("text", 
                     "year")
        )
murders_separate

Great. Now I can use spread() to put the year back into two different columns, 2014 & 2015. But I will combine this with arrange() so the output is easier to read.

murders_spread <- murders_separate %>% 
    spread(
        year,
        murders
        ) %>% 
        arrange(
            state,
            city)
murders_spread

What if I want to combine city and state into a single column city_state?

Using unite to paste one column into another

The final command, unite(), lets me paste the contents of columns onto one another. It requires the name of the new column city_state and the columns I want to combine city and state. But I want to sort this new tibble in descending order of change, and I want to remove the text variable.

I can combine all of these together with the pipe (%>%).

murders_final <- murders_spread %>%
unite(
    city_state, 
    city, 
    state) %>% 
        arrange(
            city_state
            ) %>% 
            select(
                -(text)
                )
murders_final

Output the new table as a csv

Use write.csv(murders_final, file = "murders_final.csv",row.names=FALSE, na="") and voilá, you have a csv.

Full script

A recap of what we learned

We used the pipe operator to string together various tidyr and dplyr functions for structuring our data (in tibbles). Remember that:

  1. gather() collects data across columns and puts it into into rows
  2. arrange() sorts the data in each column
  3. separate() divides the contents of a column apart into new columns
  4. spread() distributes data from rows into columns
  5. unite() attaches the contents from one column onto the contents of another column

 

 

A quick barplot

By typing barplot(murders_final$change), you can create a quick barplot of the change in murders by city.

Obviously, the barplot is not visualizing the “change” column sorted low to high. To do that, try your hand with arrange().

murders_final_sort % 
  arrange(
    change)
murders_final_sort

Then, plotting barplot(murders_final_sort$change) will produce this:

To extend the y-axis to -20, add:

barplot(murders_final_sort$change,
        ylim = c(-20, 120))

Finally, we’ll add some labels. There are many different ways to add labels in R. Here’s one way:

midpts 

 

Pro tip: RStudio publishes super helpful cheatsheets. Here’s one for dplyr:

Martin Frigaard is a graduate student at UCSF. Find him on Twitter.

Leave a Reply