 # How to explore correlations in R

This post will cover how to measure the relationship between two numeric variables with the `corrr` package. We will look at how to assess a variable’s distribution using skewness, kurtosis, and normality. Then we’ll examine the relationship between two variables by looking at the covariance and the correlation coefficient.

### Load the packages

The packages we’ll be using in this tutorial are the following:

``````library(egg) # for ggarrange
library(inspectdf) # check entire data.frame for variable types, etc.
library(visdat) # visualize missing data
library(skimr) # eda for entire data.frame
library(corrr) # correlations
library(skimr) # summaries
library(hrbrthemes) # themes for graphs
library(tidyverse) # all tidyverse packages
library(tidymodels) # meta package for modeling
library(psych) # for skewness and kurtosis
library(socviz) # for %nin%``````

### Set a graph theme

Graph theme’s give us a little customization for the graphs we’ll be producing. If you want to learn more about `ggplot2`, check out our tutorial here.

``````# set the theme to theme_light
ggplot2::theme_set(theme_ipsum_tw(base_family = "Titillium Web",
base_size = 9,
plot_title_family = "Poppins",
axis_title_size = 13))``````

### Import the data

We covered how to access data using the `tuber` in a previous tutorial. For this how-to, we’ll be two YouTube playlists:

If you’d like to see the script for how we downloaded and imported these data, they’re in a Gist here.

``````# fs::dir_tree("data", regex = "DailyShow")
``````#>  Observations: 251
#>  Variables: 9
#>  \$ published_at  <dttm> 2015-10-05 23:57:09, 2014-12-05 17:01:33, 2014-09-23…
#>  \$ dislike_count <dbl> 11184, 1065, 722, 2469, 654, 917, 547, 2846, 1449, 22…
#>  \$ url           <chr> "https://i.ytimg.com/vi/yEPSJF7BYOo/default.jpg", "ht…
#>  \$ id            <chr> "yEPSJF7BYOo", "AHO1a1kvZGo", "lPgZfhnCAdI", "9pOiOhx…
#>  \$ comment_count <dbl> 8130, 2486, 3616, 17629, 2689, 2704, 1887, 3422, 6229…
#>  \$ title         <chr> "The Daily Show - Admiral General Aladeen", "The Dail…
#>  \$ view_count    <dbl> 19159013, 5624319, 5223787, 5134669, 4920942, 4347723…
#>  \$ like_count    <dbl> 161732, 38442, 39755, 47906, 25499, 30986, 25640, 492…
#>  \$ host          <chr> "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon…``````

### Visualizations

The `DailyShowYouTube` contains 9 variables. Some of these are meta data for the videos in the playlist (`id`, `url`, and `published_at`), others contain information on the video related to viewership (`dislike_count`, `comment_count`, `view_count`, and `like_count`). For more information on these variables, check out the YouTube API documentation.

In this section, we’re going to use visualizations to help us understand how much two numeric variables are related, or how much they are correlated. We’ll start by visualizing variables by themselves, then move into bivariate (two-variable) graphs. The `skimr` and `inspectdf` packages allow us to take a quick look at an entire data frame or sets of variables.

Below is a `skimr::skim()` of the `DailyShowYouTube` `data.frame`.

``skimr::skim(DailyShowYouTube)``
Number of rows251
Number of columns9
_______________________
Column type frequency:
character4
numeric4
POSIXct1
________________________
Group variablesNone

Data summary

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
url01464602510
id01111102510
title01229602510
host0136020

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
dislike_count01805.331705.74368.5290831.519236▇▁▁▁▁
comment_count011356.732206.232159.57061473.017919▇▁▁▁▁
view_count011454472.851830174.8035414272979.510115402011034.519159013▇▁▁▁▁
like_count0113113.5818170.411822131.5839216586.0161732▇▁▁▁▁

Variable type: POSIXct

skim_variablen_missingcomplete_rateminmaxmediann_unique
published_at012013-04-24 10:02:532016-11-09 02:32:362015-10-23 15:39:54250

### Variation

The amount a variable varies represents the amount of uncertainty we have in a particular phenomena or measurement. We use numbers like variance, standard deviation, and interquartile range to represent the ‘spread’ or the dispertion of values for a particular variable. In contrast to the ‘spread’, a variable’s ‘middle’ is represented using numbers like the mean, median, and mode. These measure the central tendency (i.e. the
middle) of a variable’s distribution. We’ll explore these topics further below in visualizations.

### Skewness & Kurtosis

We want to use specific terms to describe what a variable distribution looks like because this will give us some precision in what we’re seeing. The `skimr::skim()` output above shows us four numeric variables (`dislike_count`, `comment_count`, `view_count`, and `like_count`), and the `hist` variable tells us these variables are skewed.

Skewness refers to the distribution of a variable that is not symmetrical. It’s hard to see in the `hist` column above because it’s small, so we’ll use the `inspectdf::inspect_num()` to look at the skewness and kurtosis of these
numeric `_count` variables.

``````# define labs first so we are thinking about what to expect on the graph!
hist_labs <- ggplot2::labs(
title = "Comments, Dislikes, Likes, & Views of Daily Show videos",
x = "Counts",
y = "Probability distribution",
subtitle = "Numeric variables in Daily Show YouTube meta data",
caption = "Includes both hosts (Jon Stewart & Trevor Noah)"
)
# Now we plot!
inspectdf::show_plot(x = .,
plot_layout = c(2, 2)) + # this gives
# us a 2x2
hist_labs``````

The graphs above display variables that are ‘positively skewed’, which means the bulk of the data are piled up near the lower values. The compliment to skewness is kurtosis, which is used to measure how the data is distributed in the tail of a distribution. Both skewness and kurtosis can be calculated using the `psych::describe()` function. I limit the output below to the `mean`, `sd`, `min`, `median`,`max`, `skew`, and `kurtosis`.

``````knitr::kable(
dplyr::select(like_count,
view_count) %>%
psych::describe(x = .) %>%
dplyr::select(mean, sd, min, median,
max, skew, kurtosis)
)``````
meansdminmedianmaxskewkurtosis
like_count13113.5818170.4118283921617324.43292129.75917
view_count1454472.851830174.80354141011540191590134.35360534.17134

These numbers tell us the skewness and kurtosis are both positive, but that doesn’t mean much until we discuss normality.

### Normality

Normality is another tool we can use to help describe a variable’s distribution. Normal in this case refers to how bell-shaped the distribution looks. Normally distributed variables have a skewness and kurtosis of `0`. As we can imagine, variables with high values for skewness and kurtosis won’t be normal because those values are telling us the distribution isn’t symmetrical.

Why are we doing this?

The more we understand about each variable’s distribution, the more equipped we’ll be in predicting how the values in one variable correspond to the values in another variable.

Let’s narrow our dataset down to just two variables: YouTube video views and likes. Examine the graphs below.

``````# define labs first!
hist_labs_v2 <- ggplot2::labs(
title = "Likes and Views of Daily Show YouTube videos",
x = "Counts",
y = "Probability distribution",
subtitle = "Numeric variables in Daily Show YouTube meta data",
caption = "Includes both hosts (Jon Stewart & Trevor Noah)"
)
# Now we plot!
dplyr::select(like_count,
view_count) %>%
inspectdf::inspect_num(df1 = .) %>%
inspectdf::show_plot(x = .) + # this gives
hist_labs_v2                  # us a 2x2``````

We know both `view_count` and `like_count` have positive skewness and kurtosis, but to what degree are these two variables related?

### The relationship between views and likes

How do we measure relationships between variables?

When two variables are related, that means changes in one variable result with similar changes in the second variable. In math terms, this is referred to as covariance. If variance is how much a single variable varies from the mean, then covariance is how much two variables vary together. I really like Wickham and Grolemund’s description of covariation in R for Data Science,

Covariation is the tendency for the values of two or more variables to vary together in a related way.

Wickham and Grolemund

If you think about it, the variation of a single variable is not very helpful by itself. But if we can describe or account for ‘spread’ in one variable with the ‘spread’ from another, we can make estimations about one number when we have information on the other.

So if we want to know if the videos that have a high number of views are the same videos that also have a high number of likes? We couldn’t tell from the histograms, but we can combine the two variables on a single plot to answer this question.

### Scatter plots

Recall that most of the values for views and likes are low, but this does not necessarily mean they are similar. For example, the mean value for `view_count` is much higher than the mean value for `like_count` (`13,113.58` vs. `1,454,472.85`). If we compare the maximum values (i.e. the extreme values at the tails), we can see the maximum value for `view_count` is `19,159,013` vs. a maximum value of `161,732` for `like_count`.

The scatter plot below will show us if there is a pattern between Daily Show YouTube video views and likes.

``````# labels
point_labs_v1 <- ggplot2::labs(
title = "Likes vs. Views of Daily Show YouTube videos",
x = "YouTube Video Views",
y = "YouTube Video Likes",
subtitle = "Daily Show videos on Comedy Central's YouTube Channel",
caption = "Includes both hosts (Jon Stewart & Trevor Noah)")
# scatter plot
ggp_scatter_v1 <- DailyShowYouTube %>%
ggplot2::ggplot(aes(x = view_count,
y = like_count)) +
ggplot2::geom_point(alpha = 3/5) +
point_labs_v1
ggp_scatter_v1``````

This pattern looks like we might be able to predict some of the values of `like_count` with values from `view_count`, but how well? We need to explore these data more with additional visualizations. For example, let’s add labels to the two videos with more than 150,000 likes to identify the extreme values on the graph.

### Outliers

First we will filter these data out, then we plot them using another layer from `ggplot2`.

``````# first we create some data sets for the top views and likes
TopDailyShowLikes <- DailyShowYouTube %>%
dplyr::filter(like_count >= 150000)

# then we plot the point with some text
ggp_scatter_v2 <- ggp_scatter_v1 +
# add labels to states
geom_text_repel(
data = TopDailyShowLikes,
aes(label = title),
# size of text
size = 2.1,
# opacity of the text
alpha = 7 / 10,
# in case there is overplotting
arrow = arrow(
length = unit(0.05, "inches"),
type = "open",
ends = "last"
),
show.legend = FALSE
)
ggp_scatter_v2``````

It looks like “The Daily Show – Admiral General Aladeen” and Confused Islamophobes Target American Sikhs: The Daily Show videos are outliers in terms of views and likes. Should we keep these videos in the data set? Well, we can be pretty sure these data haven’t been entered incorrectly. So if the values are legitimate, is there something about the videos that makes them different from the population (the other videos) we’re investigating?

Recall that we’re still including videos of the Daily Show with two different hosts (`Jon` and `Trevor`). What happens when we facet the scatter plot about the two `host`s?

``````# labels
point_labs_v3 <- ggplot2::labs(
title = "Likes vs. views of Daily Show YouTube videos by host",
x = "YouTube Video Views",
y = "YouTube Video Likes",
subtitle = "Daily Show videos on Comedy Central's YouTube Channel",
caption = "Includes both hosts (Jon Stewart & Trevor Noah)")
# scatter with facet
ggp_scatter_v3 <- DailyShowYouTube %>%
ggplot2::ggplot(aes(x = view_count,
y = like_count,
group = host)) +
ggplot2::geom_point(aes(color = host),
alpha = 3/5,
show.legend = FALSE) +
# add labels to states
geom_text_repel(
data = TopDailyShowLikes,
aes(label = title),
# size of text
size = 2.1,
# opacity of the text
alpha = 7 / 10,
# in case there is overplotting
arrow = arrow(
length = unit(0.05, "inches"),
type = "open",
ends = "last")) +
facet_wrap(. ~ host) +
point_labs_v3
ggp_scatter_v3``````

Now we can see that each host has their own extreme value. We can also see that the relationship between views and likes appear similar (as one goes up, so does the other) for both hosts.

I think we can assume these two videos are different in a meaningful way from the rest of the videos in the playlists (and we can explore the details of why this might be later). But for now, we’ll remove the videos and look at the relationship between views and likes without them.

``````# labels
point_labs_v4 <- ggplot2::labs(
title = "Likes vs. views of Daily Show YouTube videos by host",
x = "YouTube Video Views",
y = "YouTube Video Likes",
subtitle = "Daily Show videos on Comedy Central's YouTube Channel",
caption = "Removed outlier videos (> 150,000 likes)")
# outliers
outliers <- c("yEPSJF7BYOo", "RskvZgc_s9g")
# create correlation data
DailyShowCorr <- DailyShowYouTube %>%
dplyr::filter(id %nin% outliers)
# scatter
ggp_scatter_v4 <- DailyShowCorr %>%
ggplot2::ggplot(aes(x = view_count,
y = like_count,
group = host)) +
ggplot2::geom_point(aes(color = host),
alpha = 3/5,
show.legend = FALSE) +
facet_wrap(. ~ host) +
point_labs_v4
ggp_scatter_v4``````

This shows a pattern that indicates a positive relationship between viewing a video and liking a video. Can we show this numerically? How about if we calculate the covariance for each host?

### Covariance

We will calculate the level of covariation between the two variables using `mosaic::cov()`. but after we’ve split the data frame by `host` into two datasets.

``````# split these data by host
daily_split_list <- DailyShowCorr %>%
dplyr::group_split(host, keep = TRUE)
daily_split_list[] -> JonDailyShowCorr
daily_split_list[] -> TrevorDailyShowCorr
# calculate covariance for Trevor
mosaic::cov(x = TrevorDailyShowCorr\$view_count,
y = TrevorDailyShowCorr\$like_count)``````
``#>   17812574567``
``````# calculate covariance for Jon
mosaic::cov(x = JonDailyShowCorr\$view_count,
y = JonDailyShowCorr\$like_count)``````
``#>   14484129962``

These large positive numbers tell us that as `view_count` deviates from it’s mean, `like_count` deviates from it’s mean in the same direction. However, using the covariance can get us into trouble because it assumes both variables are on the same scale. In practice, we typically want to know how two variables on different scales relate to one another–that’s where calculating a correlation coefficient comes in handy.

## Correlation coefficients

We will cover two types of correlation coefficients (there are more), but both of these values lie between `− 1` and `+ 1`. The first measure we’ll cover, the Pearson correlation coefficient, is the most common, but also has some assumptions that need to be met in order for the results to be reliable.

If the assumptions of normality are violated (as they are in our data) the Spearman’s correlation coefficient can be used because it doesn’t have the same assumptions for both variables.

### Using the `corrr` package

The `corrr` package from `tidymodels` lets us explore correlations using `tidyverse` principles. We’ll start with the `JonDailyShowCorr` data, which is all the videos in the playlist (minus the “The Daily Show – Admiral General Aladeen” video).

``JonDailyShowCorr %>% glimpse(78)``
``````#>  Observations: 110
#>  Variables: 9
#>  \$ published_at  <dttm> 2014-12-05 17:01:33, 2014-09-23 22:12:10, 2013-04-24…
#>  \$ dislike_count <dbl> 1065, 722, 2469, 654, 917, 547, 2846, 1449, 2295, 102…
#>  \$ url           <chr> "https://i.ytimg.com/vi/AHO1a1kvZGo/default.jpg", "ht…
#>  \$ id            <chr> "AHO1a1kvZGo", "lPgZfhnCAdI", "9pOiOhxujsE", "aqzzWr3…
#>  \$ comment_count <dbl> 2486, 3616, 17629, 2689, 2704, 1887, 3422, 6229, 4522…
#>  \$ title         <chr> "The Daily Show - Spot the Africa", "The Daily Show -…
#>  \$ view_count    <dbl> 5624319, 5223787, 5134669, 4920942, 4347723, 4160610,…
#>  \$ like_count    <dbl> 38442, 39755, 47906, 25499, 30986, 25640, 49200, 4794…
#>  \$ host          <chr> "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon", "Jon…``````

To use `corrr::correlate()`, we select only those two variables we want to calculate a correlation coefficient for and stored the output in `JonPearsonCorDf`

``````JonPearsonCorDf <- JonDailyShowCorr %>%
dplyr::select(c(view_count,
like_count)) %>%
corrr::correlate(x = .) ``````
``````#>
#>  Correlation method: 'pearson'
#>  Missing treated using: 'pairwise.complete.obs'``````

The message we get from the output tells us the following:

• `Correlation method: 'pearson'` = The reason we spent all that time above looking at the variable distributions is because the Pearson correlation coefficient assumes normality.
• `Missing treated using: 'pairwise.complete.obs'` = This is telling us how the missing values are going to dealt with. If we recall the `skimr::skim()` output, we have no missing data, so this doesn’t apply.

### Pearson’s correlation assumptions

The most commonly used correlation coefficient is the ‘Pearson product-moment correlation coefficient’ or ‘Pearson correlation coefficient’ or just r. One of the assumptions of this test is that the sampling distribution is normally distributed (or that our sample is normally distributed). Neither of these are true. Fortunately, we have another option for calculating a correlation.

One option is to log transform both variables using `dplyr::mutate()`. Log transforming will make positively skewed variables more normally distributed, as we demonstrate in descriptive statistics below.

``````DailyShowYouTubeCorr <- DailyShowYouTube %>%
dplyr::mutate(log_view_count = log(view_count),
log_like_count = log(like_count))``````
``````knitr::kable(
dplyr::select(log_view_count,
view_count,
log_like_count,
like_count) %>%
psych::describe(x = .) %>%
dplyr::select(mean, sd, min, median,
max, skew, kurtosis))``````
meansdminmedianmaxskewkurtosis
log_view_count13.4593711.38106110.47486313.82698516.76828-0.4486365-0.7596921
view_count1454472.8525901830174.80067135414.0000001011540.00000019159013.000004.353604834.1713438
log_like_count8.6730241.4649405.2040079.03503411.99370-0.5084212-0.5494696
like_count13113.57768918170.413169182.0000008392.000000161732.000004.432921029.7591664

Although the numbers for `skew` and `kurtosis` became negative, they are closer to 0 (which represents a normally distributed variable). We can see the results of this transformation when we create a scatter plot of the transformed variables.

``````# labels
point_labs_v5 <- ggplot2::labs(
title = "Likes vs. Views of Daily Show YouTube videos",
x = "Log Transformed YouTube Video Views",
y = "Log Transformed YouTube Video Likes",
subtitle = "Daily Show videos on Comedy Central's YouTube Channel",
caption = "Log transformed variables")
# scatter plot
ggp_scatter_v5 <- DailyShowYouTubeCorr %>%
ggplot2::ggplot(aes(x = log_view_count,
y = log_like_count)) +
ggplot2::geom_point(alpha = 3/5) +
point_labs_v5
ggp_scatter_v5``````

We can see this is a much clearer pattern (the line appears more straight), and the outliers have disappeared.

Why shouldn’t you transform your variables?

First off, taking the logarithm of a variable changes the question you’re asking (i.e. “is there a relationship between `log transformed variable x` and `log transformed y`?) , and you’ll have to be very careful about interpreting the results. Fortunately there are other options.

### Spearman’s Rho

The Spearman’s rank correlation coefficient or Spearman’s Rho is essentially the same calculation as the Pearson correlation coefficient, only it ranks all of the values first. Ranking removes some of the extreme distances between values, essentially removing any underlying concerns of variable distributions.

We’ll apply the `corrr::correlate()` function to `view_count` and `like_count` in the `JonDailyShowCorr` and `TrevorDailyShowCorr` `data.frame`s and specify the `"method = spearman"`.

``````JonSpearmanCorDf <- JonDailyShowCorr %>%
dplyr::select(c(view_count,
like_count)) %>%
corrr::correlate(x = .,
method = "spearman") ``````
``````#>
#>  Correlation method: 'spearman'
#>  Missing treated using: 'pairwise.complete.obs'``````
rownameview_countlike_count
view_countNA0.9788202
like_count0.9788202NA
``````TrevorSpearmanCorDf <- TrevorDailyShowCorr %>%
dplyr::select(c(view_count,
like_count)) %>%
corrr::correlate(x = .,
method = "spearman") ``````
``````#>
#>  Correlation method: 'spearman'
#>  Missing treated using: 'pairwise.complete.obs'``````
rownameview_countlike_count
view_countNA0.9554851
like_count0.9554851NA

We can see there is a very high correlation between viewing and liking videos for both hosts. It looks like the correlation between views and likes in Jon Stewart’s videos is slightly higher (`0.9788202`) than the correlation coefficient in Trevor Noah’s videos (`0.9554851`).

### Correlations for > 2 variables

We just looked at bivariate correlations between views and likes, but what if we wanted to check the correlations between all four numeric variables (`dislike_count`, `comment_count`, `view_count`, and`like_count`)?

We can do that by passing all the numeric variables to `corrr::correlate()`, only this time we will add `quiet = TRUE` to leave the message out. We’ll also add the `corrr::rearrange()` and sort the table contents according to the highest correlations with (`method = "MDS"` and `absolute = FALSE`).

``````JonDailyShowCorrDfAll <- JonDailyShowCorr %>%
dplyr::select_if(is.numeric) %>%
corrr::correlate(x = .,
method = "spearman",
quiet = TRUE) %>%
corrr::rearrange(x = .,
method = "MDS",
absolute = FALSE)``````
``````#>  Registered S3 method overwritten by 'seriation':
#>    method         from
#>    reorder.hclust gclus``````
rownameview_countlike_countcomment_countdislike_count
view_countNA0.97882020.86653940.8684701
like_count0.9788202NA0.89118610.8809044
comment_count0.86653940.8911861NA0.9477941
dislike_count0.86847010.88090440.9477941NA
``````TrevorDailyShowCorrDfAll <- TrevorDailyShowCorr %>%
dplyr::select_if(is.numeric) %>%
corrr::correlate(x = .,
method = "spearman",
quiet = TRUE) %>%
corrr::rearrange(x = .,
method = "MDS",
absolute = FALSE)``````
rownamedislike_countcomment_countlike_countview_count
dislike_countNA0.95742040.89817600.8657234
comment_count0.9574204NA0.93308860.8767787
like_count0.89817600.9330886NA0.9554851
view_count0.86572340.87677870.9554851NA

The highest correlation for Jon Stewart’s YouTube videos was between the views and likes (`0.9788202`), while the highest correlation for Trevor Noah’s videos was between dislikes and comments (`0.9574204`).

### Graphing correlations

The `corrr` package has a few graphing options for visualizing correlations. These are additional to the graphs we created with`ggplot2` above.

#### The rplot

The `rplot` plots a correlation `data.frame` using `ggplot2` (and we can add labels to it). This graph has x and y axes, and plots the intersection for the variables from the `cor_df` (a correlation `data.frame`).

``````# labels
rplot_labs_v1 <- ggplot2::labs(
title = "Correlations between Jon Stewart's likes, dislikes, comments, and views",
subtitle = "YouTube videos of The Daily Show playlist",
caption = "Data accessed from YouTube API")
JonDailyShowCorrDfAll %>%
corrr::rplot(rdf = ., shape = 19,
colors = c("yellow",
"purple")) +
rplot_labs_v1``````

As we can see, the hue of the color and the size of the point indicate the level of the correlation. We can also remove the redundant correlations from the table with `corrr::shave()`.

``````knitr::kable(
JonDailyShowCorrDfAll %>%
corrr::shave(x = .)
)``````
rownameview_countlike_countcomment_countdislike_count
view_countNANANANA
like_count0.9788202NANANA
comment_count0.86653940.8911861NANA
dislike_count0.86847010.88090440.9477941NA

#### Network plot

The second graph options from `corrr` is the `corrr::network_plot()` which takes a `cor_df` and returns a graph where “variables that are more highly correlated appear closer together and are joined by stronger paths.

``````# labels
network_labs_v1 <- ggplot2::labs(
title = "Correlations between Trevor Noah's likes, dislikes, comments, and views",
subtitle = "YouTube videos of The Daily Show playlist",
caption = "Data accessed from YouTube API")
network_plot_v1 <- TrevorDailyShowCorr %>%
dplyr::select_if(is.numeric) %>%
corrr::correlate(x = .,
method = "spearman",
quiet = TRUE) %>%
corrr::network_plot(rdf = .,
colors = c("firebrick", "white", "dodgerblue"),
min_cor = .5) +
network_labs_v1
network_plot_v1``````

This shows us that the strongest correlations are between `view_count`s and `like_count`s, and the `comment_count`s and `dislike_count`s. The `corrr::fashion()` function gives us more fancy print options, too.

``````knitr::kable(
JonDailyShowCorrDfAll %>%
corrr::fashion(x = .,
decimals = 3,
na_print = 0)
)``````
rownameview_countlike_countcomment_countdislike_count
view_count0.979.867.868
like_count.9790.891.881
comment_count.867.8910.948
dislike_count.868.881.9480