How to explore correlations in R
This post will cover how to measure the relationship between two numeric variables with the
corrr package. We will look at how to assess a variable’s distribution using skewness, kurtosis, and normality. Then we’ll examine the relationship between two variables by looking at the covariance and the correlation coefficient.
Load the packages
The packages we’ll be using in this tutorial are the following:
library(egg) # for ggarrange library(inspectdf) # check entire data.frame for variable types, etc. library(visdat) # visualize missing data library(skimr) # eda for entire data.frame library(corrr) # correlations library(skimr) # summaries library(hrbrthemes) # themes for graphs library(tidyverse) # all tidyverse packages library(tidymodels) # meta package for modeling library(psych) # for skewness and kurtosis library(socviz) # for %nin%
Set a graph theme
Graph theme’s give us a little customization for the graphs we’ll be producing. If you want to learn more about
ggplot2, check out our tutorial here.
# set the theme to theme_light ggplot2::theme_set(theme_ipsum_tw(base_family = "Titillium Web", base_size = 9, plot_title_family = "Poppins", axis_title_size = 13))
Import the data
If you’d like to see the script for how we downloaded and imported these data, they’re in a Gist here.
# fs::dir_tree("data", regex = "DailyShow") DailyShowYouTube <- readr::read_csv("data/2019-09-30-DailyShowYouTube.csv") DailyShowYouTube %>% dplyr::glimpse(78)
#> Observations: 251 #> Variables: 9 #> $ published_at <dttm> 2015-10-05 23:57:09, 2014-12-05 17:01:33, 2014-09-23… #> $ dislike_count <dbl> 11184, 1065, 722, 2469, 654, 917, 547, 2846, 1449, 22… #> $ url <chr> "https://i.ytimg.com/vi/yEPSJF7BYOo/default.jpg", "ht… #> $ id <chr> "yEPSJF7BYOo", "AHO1a1kvZGo", "lPgZfhnCAdI", "9pOiOhx… #> $ comment_count <dbl> 8130, 2486, 3616, 17629, 2689, 2704, 1887, 3422, 6229… #> $ title <chr> "The Daily Show - Admiral General Aladeen", "The Dail… #> $ view_count <dbl> 19159013, 5624319, 5223787, 5134669, 4920942, 4347723… #> $ like_count <dbl> 161732, 38442, 39755, 47906, 25499, 30986, 25640, 492… #> $ host <chr> "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon…
DailyShowYouTube contains 9 variables. Some of these are meta data for the videos in the playlist (
published_at), others contain information on the video related to viewership (
like_count). For more information on these variables, check out the YouTube API documentation.
In this section, we’re going to use visualizations to help us understand how much two numeric variables are related, or how much they are correlated. We’ll start by visualizing variables by themselves, then move into bivariate (two-variable) graphs. The
inspectdf packages allow us to take a quick look at an entire data frame or sets of variables.
Below is a
skimr::skim() of the
|Number of rows||251|
|Number of columns||9|
|Column type frequency:|
Variable type: character
Variable type: numeric
Variable type: POSIXct
|published_at||0||1||2013-04-24 10:02:53||2016-11-09 02:32:36||2015-10-23 15:39:54||250|
The amount a variable varies represents the amount of uncertainty we have in a particular phenomena or measurement. We use numbers like variance, standard deviation, and interquartile range to represent the ‘spread’ or the dispertion of values for a particular variable. In contrast to the ‘spread’, a variable’s ‘middle’ is represented using numbers like the mean, median, and mode. These measure the central tendency (i.e. the
middle) of a variable’s distribution. We’ll explore these topics further below in visualizations.
Skewness & Kurtosis
We want to use specific terms to describe what a variable distribution looks like because this will give us some precision in what we’re seeing. The
skimr::skim() output above shows us four numeric variables (
like_count), and the
hist variable tells us these variables are skewed.
Skewness refers to the distribution of a variable that is not symmetrical. It’s hard to see in the
hist column above because it’s small, so we’ll use the
inspectdf::inspect_num() to look at the skewness and kurtosis of these
# define labs first so we are thinking about what to expect on the graph! hist_labs <- ggplot2::labs( title = "Comments, Dislikes, Likes, & Views of Daily Show videos", x = "Counts", y = "Probability distribution", subtitle = "Numeric variables in Daily Show YouTube meta data", caption = "Includes both hosts (Jon Stewart & Trevor Noah)" ) # Now we plot! inspectdf::inspect_num(DailyShowYouTube) %>% inspectdf::show_plot(x = ., plot_layout = c(2, 2)) + # this gives # us a 2x2 hist_labs
The graphs above display variables that are ‘positively skewed’, which means the bulk of the data are piled up near the lower values. The compliment to skewness is kurtosis, which is used to measure how the data is distributed in the tail of a distribution. Both skewness and kurtosis can be calculated using the
psych::describe() function. I limit the output below to the
knitr::kable( DailyShowYouTube %>% dplyr::select(like_count, view_count) %>% psych::describe(x = .) %>% dplyr::select(mean, sd, min, median, max, skew, kurtosis) )
These numbers tell us the skewness and kurtosis are both positive, but that doesn’t mean much until we discuss normality.
Normality is another tool we can use to help describe a variable’s distribution. Normal in this case refers to how bell-shaped the distribution looks. Normally distributed variables have a skewness and kurtosis of
0. As we can imagine, variables with high values for skewness and kurtosis won’t be normal because those values are telling us the distribution isn’t symmetrical.
Why are we doing this?
The more we understand about each variable’s distribution, the more equipped we’ll be in predicting how the values in one variable correspond to the values in another variable.
Let’s narrow our dataset down to just two variables: YouTube video views and likes. Examine the graphs below.
# define labs first! hist_labs_v2 <- ggplot2::labs( title = "Likes and Views of Daily Show YouTube videos", x = "Counts", y = "Probability distribution", subtitle = "Numeric variables in Daily Show YouTube meta data", caption = "Includes both hosts (Jon Stewart & Trevor Noah)" ) # Now we plot! DailyShowYouTube %>% dplyr::select(like_count, view_count) %>% inspectdf::inspect_num(df1 = .) %>% inspectdf::show_plot(x = .) + # this gives hist_labs_v2 # us a 2x2
We know both
like_count have positive skewness and kurtosis, but to what degree are these two variables related?
The relationship between views and likes
How do we measure relationships between variables?
When two variables are related, that means changes in one variable result with similar changes in the second variable. In math terms, this is referred to as covariance. If variance is how much a single variable varies from the mean, then covariance is how much two variables vary together. I really like Wickham and Grolemund’s description of covariation in R for Data Science,
Covariation is the tendency for the values of two or more variables to vary together in a related way.Wickham and Grolemund
If you think about it, the variation of a single variable is not very helpful by itself. But if we can describe or account for ‘spread’ in one variable with the ‘spread’ from another, we can make estimations about one number when we have information on the other.
So if we want to know if the videos that have a high number of views are the same videos that also have a high number of likes? We couldn’t tell from the histograms, but we can combine the two variables on a single plot to answer this question.
Recall that most of the values for views and likes are low, but this does not necessarily mean they are similar. For example, the mean value for
view_count is much higher than the mean value for
1,454,472.85). If we compare the maximum values (i.e. the extreme values at the tails), we can see the maximum value for
19,159,013 vs. a maximum value of
The scatter plot below will show us if there is a pattern between Daily Show YouTube video views and likes.
# labels point_labs_v1 <- ggplot2::labs( title = "Likes vs. Views of Daily Show YouTube videos", x = "YouTube Video Views", y = "YouTube Video Likes", subtitle = "Daily Show videos on Comedy Central's YouTube Channel", caption = "Includes both hosts (Jon Stewart & Trevor Noah)") # scatter plot ggp_scatter_v1 <- DailyShowYouTube %>% ggplot2::ggplot(aes(x = view_count, y = like_count)) + ggplot2::geom_point(alpha = 3/5) + point_labs_v1 ggp_scatter_v1
This pattern looks like we might be able to predict some of the values of
like_count with values from
view_count, but how well? We need to explore these data more with additional visualizations. For example, let’s add labels to the two videos with more than 150,000 likes to identify the extreme values on the graph.
First we will filter these data out, then we plot them using another layer from
# first we create some data sets for the top views and likes TopDailyShowLikes <- DailyShowYouTube %>% dplyr::filter(like_count >= 150000) # then we plot the point with some text ggp_scatter_v2 <- ggp_scatter_v1 + # add labels to states geom_text_repel( data = TopDailyShowLikes, aes(label = title), # size of text size = 2.1, # opacity of the text alpha = 7 / 10, # in case there is overplotting arrow = arrow( length = unit(0.05, "inches"), type = "open", ends = "last" ), show.legend = FALSE ) ggp_scatter_v2
It looks like “The Daily Show – Admiral General Aladeen” and Confused Islamophobes Target American Sikhs: The Daily Show videos are outliers in terms of views and likes. Should we keep these videos in the data set? Well, we can be pretty sure these data haven’t been entered incorrectly. So if the values are legitimate, is there something about the videos that makes them different from the population (the other videos) we’re investigating?
Recall that we’re still including videos of the Daily Show with two different hosts (
Trevor). What happens when we facet the scatter plot about the two
# labels point_labs_v3 <- ggplot2::labs( title = "Likes vs. views of Daily Show YouTube videos by host", x = "YouTube Video Views", y = "YouTube Video Likes", subtitle = "Daily Show videos on Comedy Central's YouTube Channel", caption = "Includes both hosts (Jon Stewart & Trevor Noah)") # scatter with facet ggp_scatter_v3 <- DailyShowYouTube %>% ggplot2::ggplot(aes(x = view_count, y = like_count, group = host)) + ggplot2::geom_point(aes(color = host), alpha = 3/5, show.legend = FALSE) + # add labels to states geom_text_repel( data = TopDailyShowLikes, aes(label = title), # size of text size = 2.1, # opacity of the text alpha = 7 / 10, # in case there is overplotting arrow = arrow( length = unit(0.05, "inches"), type = "open", ends = "last")) + facet_wrap(. ~ host) + point_labs_v3 ggp_scatter_v3
Now we can see that each host has their own extreme value. We can also see that the relationship between views and likes appear similar (as one goes up, so does the other) for both hosts.
I think we can assume these two videos are different in a meaningful way from the rest of the videos in the playlists (and we can explore the details of why this might be later). But for now, we’ll remove the videos and look at the relationship between views and likes without them.
# labels point_labs_v4 <- ggplot2::labs( title = "Likes vs. views of Daily Show YouTube videos by host", x = "YouTube Video Views", y = "YouTube Video Likes", subtitle = "Daily Show videos on Comedy Central's YouTube Channel", caption = "Removed outlier videos (> 150,000 likes)") # outliers outliers <- c("yEPSJF7BYOo", "RskvZgc_s9g") # create correlation data DailyShowCorr <- DailyShowYouTube %>% dplyr::filter(id %nin% outliers) # scatter ggp_scatter_v4 <- DailyShowCorr %>% ggplot2::ggplot(aes(x = view_count, y = like_count, group = host)) + ggplot2::geom_point(aes(color = host), alpha = 3/5, show.legend = FALSE) + facet_wrap(. ~ host) + point_labs_v4 ggp_scatter_v4
This shows a pattern that indicates a positive relationship between viewing a video and liking a video. Can we show this numerically? How about if we calculate the covariance for each host?
We will calculate the level of covariation between the two variables using
mosaic::cov(). but after we’ve split the data frame by
host into two datasets.
# split these data by host daily_split_list <- DailyShowCorr %>% dplyr::group_split(host, keep = TRUE) daily_split_list[] -> JonDailyShowCorr daily_split_list[] -> TrevorDailyShowCorr # calculate covariance for Trevor mosaic::cov(x = TrevorDailyShowCorr$view_count, y = TrevorDailyShowCorr$like_count)
#>  17812574567
# calculate covariance for Jon mosaic::cov(x = JonDailyShowCorr$view_count, y = JonDailyShowCorr$like_count)
#>  14484129962
These large positive numbers tell us that as
view_count deviates from it’s mean,
like_count deviates from it’s mean in the same direction. However, using the covariance can get us into trouble because it assumes both variables are on the same scale. In practice, we typically want to know how two variables on different scales relate to one another–that’s where calculating a correlation coefficient comes in handy.
We will cover two types of correlation coefficients (there are more), but both of these values lie between
− 1 and
+ 1. The first measure we’ll cover, the Pearson correlation coefficient, is the most common, but also has some assumptions that need to be met in order for the results to be reliable.
If the assumptions of normality are violated (as they are in our data) the Spearman’s correlation coefficient can be used because it doesn’t have the same assumptions for both variables.
corrr package from
tidymodels lets us explore correlations using
tidyverse principles. We’ll start with the
JonDailyShowCorr data, which is all the videos in the playlist (minus the “The Daily Show – Admiral General Aladeen” video).
JonDailyShowCorr %>% glimpse(78)
#> Observations: 110 #> Variables: 9 #> $ published_at <dttm> 2014-12-05 17:01:33, 2014-09-23 22:12:10, 2013-04-24… #> $ dislike_count <dbl> 1065, 722, 2469, 654, 917, 547, 2846, 1449, 2295, 102… #> $ url <chr> "https://i.ytimg.com/vi/AHO1a1kvZGo/default.jpg", "ht… #> $ id <chr> "AHO1a1kvZGo", "lPgZfhnCAdI", "9pOiOhxujsE", "aqzzWr3… #> $ comment_count <dbl> 2486, 3616, 17629, 2689, 2704, 1887, 3422, 6229, 4522… #> $ title <chr> "The Daily Show - Spot the Africa", "The Daily Show -… #> $ view_count <dbl> 5624319, 5223787, 5134669, 4920942, 4347723, 4160610,… #> $ like_count <dbl> 38442, 39755, 47906, 25499, 30986, 25640, 49200, 4794… #> $ host <chr> "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon…
corrr::correlate(), we select only those two variables we want to calculate a correlation coefficient for and stored the output in
JonPearsonCorDf <- JonDailyShowCorr %>% dplyr::select(c(view_count, like_count)) %>% corrr::correlate(x = .)
#> #> Correlation method: 'pearson' #> Missing treated using: 'pairwise.complete.obs'
The message we get from the output tells us the following:
Correlation method: 'pearson'= The reason we spent all that time above looking at the variable distributions is because the Pearson correlation coefficient assumes normality.
Missing treated using: 'pairwise.complete.obs'= This is telling us how the missing values are going to dealt with. If we recall the
skimr::skim()output, we have no missing data, so this doesn’t apply.
Pearson’s correlation assumptions
The most commonly used correlation coefficient is the ‘Pearson product-moment correlation coefficient’ or ‘Pearson correlation coefficient’ or just r. One of the assumptions of this test is that the sampling distribution is normally distributed (or that our sample is normally distributed). Neither of these are true. Fortunately, we have another option for calculating a correlation.
One option is to log transform both variables using
dplyr::mutate(). Log transforming will make positively skewed variables more normally distributed, as we demonstrate in descriptive statistics below.
DailyShowYouTubeCorr <- DailyShowYouTube %>% dplyr::mutate(log_view_count = log(view_count), log_like_count = log(like_count))
knitr::kable( DailyShowYouTubeCorr %>% dplyr::select(log_view_count, view_count, log_like_count, like_count) %>% psych::describe(x = .) %>% dplyr::select(mean, sd, min, median, max, skew, kurtosis))
Although the numbers for
kurtosis became negative, they are closer to 0 (which represents a normally distributed variable). We can see the results of this transformation when we create a scatter plot of the transformed variables.
# labels point_labs_v5 <- ggplot2::labs( title = "Likes vs. Views of Daily Show YouTube videos", x = "Log Transformed YouTube Video Views", y = "Log Transformed YouTube Video Likes", subtitle = "Daily Show videos on Comedy Central's YouTube Channel", caption = "Log transformed variables") # scatter plot ggp_scatter_v5 <- DailyShowYouTubeCorr %>% ggplot2::ggplot(aes(x = log_view_count, y = log_like_count)) + ggplot2::geom_point(alpha = 3/5) + point_labs_v5 ggp_scatter_v5
We can see this is a much clearer pattern (the line appears more straight), and the outliers have disappeared.
Why shouldn’t you transform your variables?
First off, taking the logarithm of a variable changes the question you’re asking (i.e. “is there a relationship between
log transformed variable x and
log transformed y?) , and you’ll have to be very careful about interpreting the results. Fortunately there are other options.
The Spearman’s rank correlation coefficient or Spearman’s Rho is essentially the same calculation as the Pearson correlation coefficient, only it ranks all of the values first. Ranking removes some of the extreme distances between values, essentially removing any underlying concerns of variable distributions.
We’ll apply the
corrr::correlate() function to
like_count in the
data.frames and specify the
"method = spearman".
JonSpearmanCorDf <- JonDailyShowCorr %>% dplyr::select(c(view_count, like_count)) %>% corrr::correlate(x = ., method = "spearman")
#> #> Correlation method: 'spearman' #> Missing treated using: 'pairwise.complete.obs'
TrevorSpearmanCorDf <- TrevorDailyShowCorr %>% dplyr::select(c(view_count, like_count)) %>% corrr::correlate(x = ., method = "spearman")
#> #> Correlation method: 'spearman' #> Missing treated using: 'pairwise.complete.obs'
We can see there is a very high correlation between viewing and liking videos for both hosts. It looks like the correlation between views and likes in Jon Stewart’s videos is slightly higher (
0.9788202) than the correlation coefficient in Trevor Noah’s videos (
Correlations for > 2 variables
We just looked at bivariate correlations between views and likes, but what if we wanted to check the correlations between all four numeric variables (
We can do that by passing all the numeric variables to
corrr::correlate(), only this time we will add
quiet = TRUE to leave the message out. We’ll also add the
corrr::rearrange() and sort the table contents according to the highest correlations with (
method = "MDS" and
absolute = FALSE).
JonDailyShowCorrDfAll <- JonDailyShowCorr %>% dplyr::select_if(is.numeric) %>% corrr::correlate(x = ., method = "spearman", quiet = TRUE) %>% corrr::rearrange(x = ., method = "MDS", absolute = FALSE)
#> Registered S3 method overwritten by 'seriation': #> method from #> reorder.hclust gclus
TrevorDailyShowCorrDfAll <- TrevorDailyShowCorr %>% dplyr::select_if(is.numeric) %>% corrr::correlate(x = ., method = "spearman", quiet = TRUE) %>% corrr::rearrange(x = ., method = "MDS", absolute = FALSE)
The highest correlation for Jon Stewart’s YouTube videos was between the views and likes (
0.9788202), while the highest correlation for Trevor Noah’s videos was between dislikes and comments (
corrr package has a few graphing options for visualizing correlations. These are additional to the graphs we created with
rplot plots a correlation
ggplot2 (and we can add labels to it). This graph has x and y axes, and plots the intersection for the variables from the
cor_df (a correlation
# labels rplot_labs_v1 <- ggplot2::labs( title = "Correlations between Jon Stewart's likes, dislikes, comments, and views", subtitle = "YouTube videos of The Daily Show playlist", caption = "Data accessed from YouTube API") JonDailyShowCorrDfAll %>% corrr::rplot(rdf = ., shape = 19, colors = c("yellow", "purple")) + rplot_labs_v1
As we can see, the hue of the color and the size of the point indicate the level of the correlation. We can also remove the redundant correlations from the table with
knitr::kable( JonDailyShowCorrDfAll %>% corrr::shave(x = .) )
The second graph options from
corrr is the
corrr::network_plot() which takes a
cor_df and returns a graph where “variables that are more highly correlated appear closer together and are joined by stronger paths.”
# labels network_labs_v1 <- ggplot2::labs( title = "Correlations between Trevor Noah's likes, dislikes, comments, and views", subtitle = "YouTube videos of The Daily Show playlist", caption = "Data accessed from YouTube API") network_plot_v1 <- TrevorDailyShowCorr %>% dplyr::select_if(is.numeric) %>% corrr::correlate(x = ., method = "spearman", quiet = TRUE) %>% corrr::network_plot(rdf = ., colors = c("firebrick", "white", "dodgerblue"), min_cor = .5) + network_labs_v1 network_plot_v1
This shows us that the strongest correlations are between
like_counts, and the
corrr::fashion() function gives us more fancy print options, too.
knitr::kable( JonDailyShowCorrDfAll %>% corrr::fashion(x = ., decimals = 3, na_print = 0) )