Getting started with stringr for textual analysis in R

Data Journalism in R, How to, Reinventing Local TV News
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone
Manipulating characters – a.k.a. non-numerical data – is an essential skill for anyone looking to visualize or analyze text data. This tutorial will go over a few of the base R functions for manipulating strings in R, and introduce the stringr package from the tidyverse. The datasets being used are being analyzed as part of the Reinventing Local TV News Project at Northeastern University’s School of Journalism.  The full RMarkdown file can be found here.

Load the Packages

Start by loading the packages we will be using in this tutorial.

library(dplyr) # Data wrangling, glimpse(75) and tbl_df().

library(ggplot2) # Visualise data.
library(lubridate) # Dates and time.
library(readr) # Efficient reading of CSV data.
library(stringr) # String operations.
library(tibble) # Convert row names into a column.
library(tidyr) # Prepare a tidy dataset, gather().
library(magrittr) # Pipes %>%, %T>% and equals(), extract().
library(tidyverse) # all tidyverse packages
library(mosaic) # favstats and other summary functions
library(fs) # file management functions
library(forcats) # manipulate data for visualization
library(cowplot) # visualization
library(stringi) # more strings
library(tidytext) # tidying text data for analysis
library(maps) # maps for us.cities

Load the Data

We will be using 6 months of news article data from ABC7NY in New York and KCRA in California, spanning July 18 to January 16 (headline, date-time, teaser, url). All times are Eastern Standard Time. You can find the files here: abc7ny.csv and kcra.csv.

Abc7 <- read_csv("Data/abc7ny.csv")
Kcra <- read_csv("Data/kcra.csv")

Inspect the data sets

Take a quick look at each data set with dplyr::glimpse(75)

Abc7 %>% glimpse(78)
Observations: 9,745
Variables: 6
$ datetime  <chr> "July 18, 2017 at 03:41PM", "July 18, 2017 at 03:55PM",...
$ headline  <chr> "9-year-old boy wants to thank every cop in the U.S. wi...
$ url       <chr> "http://abc7ny.com/news/donut-boy-wants-to-thank-every-...
$ teaser    <chr> "Tyler Carach, a 9-year-old boy from Florida, wants to ...
$ feed_name <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", "abc7ny.c...
$ feed_url  <chr> "http://abc7ny.com/feed", "http://abc7ny.com/feed", "ht...
Kcra %>% glimpse(78)
Observations: 13,020
Variables: 6
$ datetime  <chr> "July 18, 2017 at 03:56PM", "July 18, 2017 at 03:58PM",...
$ headline  <chr> "Woman says product intended for wasps trapped 7 birds ...
$ url       <chr> "http://www.kcra.com/article/woman-says-product-intende...
$ teaser    <chr> "The customer's Facebook post has been shared thousands...
$ feed_name <chr> "KCRA Top Stories", "KCRA Top Stories", "KCRA Top Stori...
$ feed_url  <chr> "http://www.kcra.com", "http://www.kcra.com", "http://w...

We can see these data frames contain the same variables, but if we had many, many columns we could use base::identical()and base::names() to test if the columns are the same.

base::identical(names(Abc7), names(Kcra))
[1] TRUE

Combine the data sets

We want to combine these data sets into a single data frame NewsData. As with most operations in R, there are multiple ways to approach combining data frames. I prefer to use dplyr for merging and joining because 1) the functions are fast and efficient and 2) the arguments are somewhat intuitive.

I say somewhat because dplyr::bind_rows() sticks two data frames together top-to-bottom, and dplyr::bind_cols() puts two data frames together side-by-side. If you look at the figure below, you can see what I mean. Combining two data frames together top-to-bottom means the data gets added into the matched columns (and un-matched columns get filled with NA).

But if I want to combine data frames together side-by-side, they will be matched on their row position (so I would need to make sure the same number of rows are in each data frame).

NOTE: The tiny object in the rows are chairs, because for some reason, whenever I think of rows, I think of a row of chairs. The same goes for columns (except I think about a Roman pillar).

Which one do I want to use?

I can see from the code chunks above that both data frames have identical column names, but an unequal number of rows. This makes dplyr::bind_rows() the correct option. I might want to separate each data frame again later, so including an ID (like data_id) that allows me to see what data set each headline originally came from is a good idea. Fortunately, there is an argument in dplyr::bind_rows() for including an id (.id =).

Before I start executing functions, I should think about to expect in NewsData. The Abc7 data frame has 9,745 and 6 variables, and Kcra has 13,020 observations and 6 variables. If I use dplyr::bind_rows(), I’m expecting a data frame with the sum of all observations (i.e. 22,765) and 6 variables.

# test 
dplyr::bind_rows(Abc7, Kcra, .id = "data_id") %>% 
    dplyr::count(data_id)
data_id
<chr>
n
<int>
1 9745
2 13020

NOTE: I’m going to introduce a little sequence I go through for most data manipulation tasks. It includes three steps: 1) test, 2) assign, and 3) verify. These steps are for making sure I am getting what I expect from each task. In the test step, I set up some way of verifying the function or pipeline is doing what I want. For this example, I used dplyr::count() to see if "data_id" is the new data frame and if it can distinguish the original data sets.

Now I will assign the new variables to NewsData and verify it gives the same information.

# assign
NewsData <- dplyr::bind_rows(Abc7, Kcra, .id = "data_id")
# verify
NewsData %>% 
    dplyr::count(data_id)
data_id
<chr>
n
<int>
1 9745
2 13020

Great! I just need to remember that 1 is the id for Abc7 and 2 is the id for Kcra. Now I can start with the string manipulations.

NewsData %>% 
    glimpse(75)
Observations: 22,765
Variables: 7
$ data_id   <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1...
$ datetime  <chr> "July 18, 2017 at 03:41PM", "July 18, 2017 at 03:55P...
$ headline  <chr> "9-year-old boy wants to thank every cop in the U.S....
$ url       <chr> "http://abc7ny.com/news/donut-boy-wants-to-thank-eve...
$ teaser    <chr> "Tyler Carach, a 9-year-old boy from Florida, wants ...
$ feed_name <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", "abc7n...
$ feed_url  <chr> "http://abc7ny.com/feed", "http://abc7ny.com/feed", ...

String manipulation with base R

There are a few functions in base R that come in handy when dealing with strings. These first functions we will cover are basic, but they introduce important concepts like how strings are printed, and the kinds of manipulations that are possible with R. I’ll demo these by taking the first five lines of the headline column and putting it into it’s object (headlines_var) using dplyr::select() and Matrix::head().

headlines_var <- NewsData %>% 
    dplyr::select(headline) %>% 
    Matrix::head(5) 

Testing Character Strings

The base functions in R takes a character vector, so the first thing I’ll need to do is test and see if headlines_var satisfies that condition. I can do this with base::is.character().

base::is.character(headlines_var)
[1] FALSE

FALSE!?!? But this looked like a character vector when I used dplyr::glimpse(75) above. What kind of object this isheadlines_var?

I can test that with base::typeof().

base::typeof(headlines_var)
[1] "list"

Ahhhh headlines_var is a list. This means I’ll need to use base::unlist() to convert the headlines_var list to a character vector.

headlines_var <- headlines_var %>% base::unlist() 
base::is.character(headlines_var)
[1] TRUE

NOTE: Other options for converting objects to strings include base::is.character() and base::toString().

I’ll take a look at this vector and see what the contents look like with utils::str().

headlines_var %>% utils::str()
 Named chr [1:5] "9-year-old boy wants to thank every cop in the U.S. with doughnuts" ...
 - attr(*, "names")= chr [1:5] "headline1" "headline2" "headline3" "headline4" ...

We can see the unlist() function retains the "names" attribute in the headlines_var vector.

headlines_var
                                                                     headline1 
          "9-year-old boy wants to thank every cop in the U.S. with doughnuts" 
                                                                     headline2 
"10-year-old boy critical after being struck by boat propeller on Long Island" 
                                                                     headline3 
                 "10-year-old boy critical after Long Island boating accident" 
                                                                     headline4 
      "American Airlines mechanic marks record-breaking 75 years with company" 
                                                                     headline5 
                        "12-year-old girl pulled from surf in Sandy Hook dies" 

Removing characters

I can see from the contents of headlines_var that it is a vector with all five headlines. I’ll demonstrate the base::sub()function by removing the - characters.

base::sub(pattern = "-", replacement = " ", x = headlines_var)
                                                                     headline1 
          "9 year-old boy wants to thank every cop in the U.S. with doughnuts" 
                                                                     headline2 
"10 year-old boy critical after being struck by boat propeller on Long Island" 
                                                                     headline3 
                 "10 year-old boy critical after Long Island boating accident" 
                                                                     headline4 
      "American Airlines mechanic marks record breaking 75 years with company" 
                                                                     headline5 
                        "12 year-old girl pulled from surf in Sandy Hook dies" 

The first pass with base::sub() only finds the first instance of - and replaces it with a blank space " ". If I want to remove ALL instances of -, I can use base::gsub().

base::gsub(pattern = "-", replacement = " ", x = headlines_var)
                                                                     headline1 
          "9 year old boy wants to thank every cop in the U.S. with doughnuts" 
                                                                     headline2 
"10 year old boy critical after being struck by boat propeller on Long Island" 
                                                                     headline3 
                 "10 year old boy critical after Long Island boating accident" 
                                                                     headline4 
      "American Airlines mechanic marks record breaking 75 years with company" 
                                                                     headline5 
                        "12 year old girl pulled from surf in Sandy Hook dies" 

Another option for replacing every instance of a character is base::chartr(), but this requires that the old argument is the same length of the new argument (otherwise you will see 'old' is longer than 'new').

base::chartr(old = "-", new = " ", x = headlines_var)
                                                                     headline1 
          "9 year old boy wants to thank every cop in the U.S. with doughnuts" 
                                                                     headline2 
"10 year old boy critical after being struck by boat propeller on Long Island" 
                                                                     headline3 
                 "10 year old boy critical after Long Island boating accident" 
                                                                     headline4 
      "American Airlines mechanic marks record breaking 75 years with company" 
                                                                     headline5 
                        "12 year old girl pulled from surf in Sandy Hook dies" 

Pasting

Whenever you need to combine or “stick” string elements together, you can either use base::paste() or base::paste0(). For example, if I wanted to combine all of these headlines into a single, long, character vector, I could use the following

base::paste(headlines_var, collapse = "; ")
[1] "9-year-old boy wants to thank every cop in the U.S. with doughnuts; 10-year-old boy critical after being struck by boat propeller on Long Island; 10-year-old boy critical after Long Island boating accident; American Airlines mechanic marks record-breaking 75 years with company; 12-year-old girl pulled from surf in Sandy Hook dies"

Or if I wanted to combine all the headlines but not include any whitespace between the characters, I could use base::paste0()

base::paste0(headlines_var, sep = "", collapse = "; ")
[1] "9-year-old boy wants to thank every cop in the U.S. with doughnuts; 10-year-old boy critical after being struck by boat propeller on Long Island; 10-year-old boy critical after Long Island boating accident; American Airlines mechanic marks record-breaking 75 years with company; 12-year-old girl pulled from surf in Sandy Hook dies"

Printing

I typically use RStudio Notebooks, so I like to have some flexibility with my printing options. There are a few base R printing options you should know for printing strings (or anything) to the R console.

If you want to see the results of a vector printed without quotes, use base::noquote()

base::noquote(headlines_var)
                                                                   headline1 
          9-year-old boy wants to thank every cop in the U.S. with doughnuts 
                                                                   headline2 
10-year-old boy critical after being struck by boat propeller on Long Island 
                                                                   headline3 
                 10-year-old boy critical after Long Island boating accident 
                                                                   headline4 
      American Airlines mechanic marks record-breaking 75 years with company 
                                                                   headline5 
                        12-year-old girl pulled from surf in Sandy Hook dies 

Another option is base::cat(). This also comes with a sep = argument.

base::cat(headlines_var, sep = ", ")
9-year-old boy wants to thank every cop in the U.S. with doughnuts, 10-year-old boy critical after being struck by boat propeller on Long Island, 10-year-old boy critical after Long Island boating accident, American Airlines mechanic marks record-breaking 75 years with company, 12-year-old girl pulled from surf in Sandy Hook dies

The stringr package

The tidyverse comes with stringr, a handy package for handling string manipulations. You can find it on the tidyversewebsite or you can read about here in R for Data Science.

stringr (like other packages in the tidyverse) is very convenient because all of the functions have a common prefix, str_

p

Everything we just did with base R is possible with stringr, so I’m going to go through some of this package’s lesser-known functions.

Changing case

It’s often useful to convert the entire corpus of text to lowercase so future string matching is a little easier. I can do this by changing the case of headlines and storing it in a new variable headline_low using stringr::str_to_lower() and dplyr::mutate().

NOTE: I’m going to perform the remaining functions with the NewsData data frame (because that is the preferred object in the tidyverse).

NewsData %>% 
    dplyr::mutate(headline_low = stringr::str_to_lower(headline)) %>% 
    dplyr::select(headline, headline_low) %>% 
    head(5)
headline
<chr>
9-year-old boy wants to thank every cop in the U.S. with doughnuts
10-year-old boy critical after being struck by boat propeller on Long Island
10-year-old boy critical after Long Island boating accident
American Airlines mechanic marks record-breaking 75 years with company
12-year-old girl pulled from surf in Sandy Hook dies

Then again, these are headlines, so maybe we want another variable that converts all headlines to Title Case using stringr::str_to_title().

NewsData %>% 
    dplyr::mutate(headline_title = stringr::str_to_title(headline)) %>% 
    dplyr::select(headline, headline_title) %>% 
    head(5)
headline
<chr>
9-year-old boy wants to thank every cop in the U.S. with doughnuts
10-year-old boy critical after being struck by boat propeller on Long Island
10-year-old boy critical after Long Island boating accident
American Airlines mechanic marks record-breaking 75 years with company
12-year-old girl pulled from surf in Sandy Hook dies

Finding words

I might also curious about the first few words in the teaser associated with each headline. Let’s assume I want to see the frequency of each word by news feed (or data_id). I can use the stringr::word() function to extract the first three words and store it in a new variable teaser_3_words.

# test
NewsData %>% 
    dplyr::mutate(teaser_3_words = stringr::word(NewsData$teaser, 1, 3)) %>% 
    count(teaser_3_words, sort = TRUE) %>% 
    head(10)
teaser_3_words
<chr>
n
<int>
President Donald Trump 216
Get weather conditions 132
Police say the 114
KCRA 3 Weather 107
Here and Now: 102
Good morning, Northern 87
The incident happened 77
The fire broke 65
Authorities say the 43
The New York 42

This looks like it’s working, so I’ll assign teaser_3_words to the NewsData data frame.

NewsData <- NewsData %>% 
    dplyr::mutate(teaser_3_words = stringr::word(NewsData$teaser, 1, 3))

Now I can verify this new variable is behaving the way it should by using the same dplyr::count() function above.

NewsData %>% 
    dplyr::count(teaser_3_words, sort = TRUE) %>% 
    utils::head(10)
teaser_3_words
<chr>
n
<int>
President Donald Trump 216
Get weather conditions 132
Police say the 114
KCRA 3 Weather 107
Here and Now: 102
Good morning, Northern 87
The incident happened 77
The fire broke 65
Authorities say the 43
The New York 42

Sometimes I like to verify a new variable using a visualization instead of a table. I find it helpful to think of the data I am manipulating as a distribution (or shape) instead of just a set of numbers in a table. A figure has more attributes I can think about (shape, size, color, position, etc.), where a table just has lines and text.

For example, when I created teaser_3_words with the first three words from the teaser, I was thinking about how those words might vary by station (ABC7NY or KCRA) over time. I assume there is some overlap in teaser_3_words between these stations. I can see this if I run dplyr::distinct() and base::nrow().

NewsData %>% 
    dplyr::distinct(teaser_3_words) %>% 
    base::nrow()
[1] 15415

If there are 15415 distinct teaser_3_words and 22765 observations in NewsData, I can safely assume there is some overlap between these two stations.

Create a visualization with a pipeline

If I want to visualize this, I need to add a column to the NewsData data frame that has the results from dplyr::count(). This is possible with dplyr::add_count(). Then I want to dplyr::arrange() the data so the largest numbers are on top. n isn’t a very descriptive variable name, so I will dplyr::rename() this to tease_3rd_count.

Things get a little trickier here, because I also want to add a total N by news feed and a proportion variable that will tell me how many times teaser_3_words showed up in that teaser relative to the total number of teasers in the feed (tease_3rd_prop).

This little trick can be accomplished using the following steps:

  1. Use dplyr::group_by(), dplyr::add_tally(), and dplyr::ungroup() to collapse the data frame on the two news feeds in data_id, create a summary n variable, and then expand this data frame back to it’s original shape (plus one variable).
  2. Change the name of n to newsfeed_n with dplyr::rename()
  3. Finally, use dplyr::mutate() to create the proportion variable tease_3rd_prop, a factor version of data_id, and convert the existing teaser_3_words variable to a factor.

When I test a long pipeline like this, I do it iteratively. At the end of each step, I add a %>% dplyr::glimpse(75) or another viewing function to see what I am getting.

NewsData <- NewsData %>% 
    dplyr::add_count(teaser_3_words) %>% # count this variable and add it
    dplyr::arrange(desc(n)) %>% # arrange the new n data with largest on top
    dplyr::rename(tease_3rd_count = n) %>%  # get rid of n variable
    dplyr::group_by(data_id) %>% # collapse the data frame by news feed
    dplyr::add_tally() %>% # add the total count 
    dplyr::ungroup() %>% # expand the data to all variables again
    dplyr::rename(newsfeed_n = n) %>% # rename n to newsfeed_n
    dplyr::mutate(tease_3rd_prop = tease_3rd_count/newsfeed_n, # create prop
            data_id_fct = factor(data_id, # create factor for ID
                                   levels = c(1, 2),
                                   labels = c("Abc7", 
                                          "Kcra")),
            teaser_3_words = factor(teaser_3_words)) 
NewsData %>% 
    dplyr::glimpse(75)
Observations: 22,765
Variables: 12
$ data_id         <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "...
$ datetime        <chr> "July 18, 2017 at 05:39PM", "July 19, 2017 at ...
$ headline        <chr> "President Trump blasts Congress over failure ...
$ url             <chr> "http://abc7ny.com/politics/trump-blasts-congr...
$ teaser          <chr> "President Donald Trump blasted congressional ...
$ feed_name       <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", ...
$ feed_url        <chr> "http://abc7ny.com/feed", "http://abc7ny.com/f...
$ teaser_3_words  <fct> President Donald Trump, President Donald Trump...
$ tease_3rd_count <int> 216, 216, 216, 216, 216, 216, 216, 216, 216, 2...
$ newsfeed_n      <int> 9745, 9745, 9745, 9745, 9745, 9745, 9745, 9745...
$ tease_3rd_prop  <dbl> 0.02217, 0.02217, 0.02217, 0.02217, 0.02217, 0...
$ data_id_fct     <fct> Abc7, Abc7, Abc7, Abc7, Abc7, Abc7, Abc7, Abc7...

Now I can create a Cleveland Plot that shows the proportion of the first three word occurrence by news station. Note that I also use a few dplyr functions to make the plot a little clearer.

NewsData %>% 
    dplyr::arrange(desc(tease_3rd_prop)) %>% # sort desc by the proportion
    dplyr::filter(tease_3rd_count >= 50) %>% # only keep frequecies above 50
    dplyr::filter(!is.na(teaser_3_words)) %>% # remove missing
    # Make the plot
    ggplot2::ggplot(aes(x = tease_3rd_prop, # plot the prop
                        y = fct_reorder(teaser_3_words, # reorder words
                                        tease_3rd_count), # by counts
                     fill = data_id_fct,
                    group = teaser_3_words)) + # fill by feed
    ggplot2::geom_segment(aes(yend = teaser_3_words), 
                     xend = 0, 
                    color = "grey50") +
    ggplot2::geom_point(size = 3, 
                   aes(color = data_id_fct), 
                 show.legend = FALSE) + 
    ggplot2::facet_wrap( ~ data_id_fct, # arrange in 2 columns
           nrow = 2, 
         scales = "free", # free scales
       labeller = as_labeller( # add custom lables
       c(`Abc7` = "ABC7NY News Feed", 
         `Kcra` = "KCRA Sacramento News Feed"))) +
            ggplot2::scale_x_continuous(labels = scales::percent) + # use %
            ggplot2::theme(strip.text.x = element_text(face = "bold")) + # bold type
            ggplot2::labs( # labels and captions
                  x = "Percent First Three Words Appeared", 
                  y = NULL,
            caption = "These are the first three words from headline teasers 
            appearing in ABC7NY and KCRA (from Sacramento)",
              title = "TEAS3RS - Trump, Weather, Police",
           subtitle = "The First Three Words From News Headlines Teasers") 

 

As you can see, these pipelines provide a lot of flexibility for manipulating and visualizing data.

Joins

When you want to combine data frame one data set into another, but match these data on a common variable, you need to perform a join. Broadly speaking, there are two types of joins: inner and outer.

Inner joins

Inner joining comes in handy when you need to combine two tables, but only want keep the observations that are present in both tables. Read more about these here.

Outer joins

Outer joins can come in three types: left, right , and full outer joins. The left_join() gets all the rows from the left table, including those with NA values in the join field.

The important thing about dplyr::left_join() is the order the tables are listed in the function. Inside the dplyr::left_join()function, the first table listed is the primary table, and the second table is the secondary table. The dplyr::left_join() will join the secondary table to the primary table by matching on the rows from the primary table, even if there aren’t matches with rows in the secondary table.

I’ll demonstrate a join using the data from the maps package.

Joining to maps data

UsCity <- maps::us.cities
UsCity %>% glimpse(75)
Observations: 1,005
Variables: 6
$ name        <chr> "Abilene TX", "Akron OH", "Alameda CA", "Albany GA...
$ country.etc <chr> "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "V...
$ pop         <int> 113888, 206634, 70069, 75510, 93576, 45535, 494962...
$ lat         <dbl> 32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 3...
$ long        <dbl> -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, ...
$ capital     <int> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

I want to join the name of the city in NewsData to the city in maps::us.cities (now UsCity). First I want to know how many city names are in UsCity.

UsCity %>% dplyr::distinct(name) %>% base::nrow()
[1] 1005

There are 1005 distinct cities in this data frame. The city information I want to use from NewsData is stored in the headline. In order to join these data frames, they need to joined on an ID that is the same case and string in each data frame. In order to do this, I need to talk a little about regular expressions.

Regular Expressions (regex)

Regular expressions are “pattern[s] which specifies a set of strings of characters”. They are used for pattern matching in many programming languages. If you’re going to be analyzing mostly textual data, I recommend the excellent texts from O’Reilly and the introduction from Thomas Nield.

An example of a regular expression to extract all the numerical characters in headlines_var would be:

stringr::str_extract_all(string = headlines_var, pattern = "[\\d]")
[[1]]
[1] "9"

[[2]]
[1] "1" "0"

[[3]]
[1] "1" "0"

[[4]]
[1] "7" "5"

[[5]]
[1] "1" "2"

If this looks strange to you, it’s because regular expressions are a bit of a nightmare. They were originally created in the 1940s as a way to describe languages, but now their primary purpose is to test the frustration threshold for programmers and analysts.

Why are regular expressions so awful/powerful?

Regular expressions are highly composable, which only adds to their complexity. The composability of a system or language refers to how many ways we can mix and match it’s elements together to satisfy our needs (which in this case means to ‘match a string or character’). For any given string or set of strings, there are countless ways of matching it with regular expressions.

“The composability of the language is the degree to which the various methods of composition can be applied uniformly to all the various objects of the language, both atomic and compound.” – James Clark

Regular Expressions in R

R has some funny quirks that add the fun of using regular expressions. In real life (outside of R), we would’ve used the following abstract sequence to match any digit from 0 to 9 in headlines_var.

stringr::str_extract_all(string = headlines_var, pattern = "[\d]")
# Error: '\d' is an unrecognized escape in character string starting ""[\d"

But in R we have to “escape the escape” character \, which means "[\d]" becomes "[\\d]". This is important if you are using a regular expression testing website. These websites allow you check a regular expression on some sample text, but they might match your needs within R.


Common Regular Expressions

Below is a quick list of regular expressions I use often:

Symbol = What it does

$ = Matches the end of the line

^ = Matches the beginning of the line

+ = Matches one or more of the specified characters

* = Matches 0 or more of the specified characters

\\w = Matches any word character (0-9, A-z). To match non-word characters, use \\W

\\s = Matches any whitespace character (tab, line feed, or carriage return). To match non-whitespace characters, use \\S

\\t = Matches a tab character

\\r = Matches a carriage return. Use \\n for line feed.

\\ = This is the escape character. It matches the exact character that follows, not the regex pattern character.

Create city_id in UsCity

Remove the state from the name variable using the information stored in country.etc. I can do this by putting using the state information in country.etc and stringr::str_replace_all() to create city. I will also trim any additional whitespace from this variable with stringr::str_trim() and convert it to lower case using stringr::str_to_lower().

UsCity <- UsCity %>% 
    mutate(city_id = stringr::str_replace_all(string = UsCity$name, 
                         pattern = UsCity$country.etc, 
                         replacement = ""),
           city_id = stringr::str_trim(city_id),
           city_id = stringr::str_to_lower(city_id))
UsCity %>% glimpse(75)
Observations: 1,005
Variables: 7
$ name        <chr> "Abilene TX", "Akron OH", "Alameda CA", "Albany GA...
$ country.etc <chr> "TX", "OH", "CA", "GA", "NY", "OR", "NM", "LA", "V...
$ pop         <int> 113888, 206634, 70069, 75510, 93576, 45535, 494962...
$ lat         <dbl> 32.45, 41.08, 37.77, 31.58, 42.67, 44.62, 35.12, 3...
$ long        <dbl> -99.74, -81.52, -122.26, -84.18, -73.80, -123.09, ...
$ capital     <int> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ city_id     <chr> "abilene", "akron", "alameda", "albany", "albany",...

I will need to check all the words in the headline to see how many mention the cities in city_id. To do this I need to put the UsCity$city_id into it’s own vector city_id_vec. I can do this with base::unlist() and base::paste(). I should consider how I’ll be searching for a pattern match, though. Take the example headline below:

"Trump to attend commissioning of USS Gerald R. Ford warship"

If I don’t include spaces around the collapse = " | " argument I would get a match for the city mission (for Mission, TX).

city_id_vec <- unlist(UsCity$city_id)
city_id_vec <- paste(city_id_vec, sep = "", collapse = " | ")

This new vector should look like this:

[1] "abilene | akron | alameda | albany | albany | albany | albuquerque | 
alexandria | alexandria | alhambra | aliso viejo | allen | allentown | aloha
| altadena | altamonte springs | altoona | amarillo | ames | anaheim | 
anchorage | anderson | andover | ann arbor | annandale | annapolis | antioch
| apple valley | apple valley | appleton | arcadia | arden-arcade | 
arlington heights | arlington | arlington | arlington | arvada |...

Create city_id in NewsData

Now I need to make sure the contents of headline in NewsData are all lower case.

NewsData <- NewsData %>% 
    mutate(headline = stringr::str_to_lower(headline))
NewsData %$% head(headline, 1)
[1] "president trump blasts congress over failure of gop health care bill"

The trick here is to see which headlines in NewsData contains cities from city_id. I do this by dropping stringr::str_detect() inside dplyr::filter() to only include the data with a match from city_id. The stringr::str_detect() function takes string vector (or character column) as an input and check for matches provided by the pattern argument.

MapNewsData <- NewsData %>% 
    filter(stringr::str_detect(string = headline, 
                               pattern = city_id_vec)) %>% 
    dplyr::select(headline, 
            dplyr::everything())
MapNewsData %>% dplyr::glimpse(75)
Observations: 2,322
Variables: 12
$ headline        <chr> "president trump, politicians take to twitter ...
$ data_id         <chr> "1", "1", "1", "1", "1", "2", "2", "1", "1", "...
$ datetime        <chr> "October 31, 2017 at 06:49PM", "November 27, 2...
$ url             <chr> "http://abc7ny.com/trump-tweets-after-nyc-ramp...
$ teaser          <chr> "President Donald Trump called the deadly truc...
$ feed_name       <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", ...
$ feed_url        <chr> "http://abc7ny.com/feed", "http://abc7ny.com/f...
$ teaser_3_words  <fct> President Donald Trump, President Donald Trump...
$ tease_3rd_count <int> 216, 216, 216, 216, 216, 216, 216, 114, 114, 1...
$ newsfeed_n      <int> 9745, 9745, 9745, 9745, 9745, 13020, 13020, 97...
$ tease_3rd_prop  <dbl> 0.022165, 0.022165, 0.022165, 0.022165, 0.0221...
$ data_id_fct     <fct> Abc7, Abc7, Abc7, Abc7, Abc7, Kcra, Kcra, Abc7...

Now that I’ve filtered the data set down to only those 2,322 I’ll be needing to join back to UsCity, I need to create a city_idvariable in MapNewsData. I can do this with stringr::str_extract(). This function takes a string vector, a pattern to match on (which I create with paste(city_id_vec, collapse = "|")).

MapNewsData <- MapNewsData %>% 
    mutate(city_id = str_extract(string = MapNewsData$headline, 
                     pattern = paste(city_id_vec, collapse = "|")),
           city_id = stringr::str_trim(city_id)) 
MapNewsData %>% 
    count(city_id, sort = TRUE) %>% 
    head(5)
city_id
<chr>
n
<int>
sacramento 384
new york 318
las vegas 178
stockton 76
sandy 68

This looks like it’s working. I am going to use dplyr::inner_join() to combine MapNewsData and UsCity by city_id. The result should only be observations with data in both tables (hopefully cities in both tables). I can check this with a combination of dplyr::count() and tidyr::spread().

dplyr::inner_join(MapNewsData, UsCity, by = "city_id") %>% 
    utils::head(20) %>% 
    dplyr::count(city_id, name) %>% 
    tidyr::spread(name, n)
city_id
<chr>
Bayonne NJ
<int>
Manhattan KS
<int>
Moore OK
<int>
New York NY
<int>
Newark CA
<int>
Newark NJ
<int>
Newark OH
<int>
Reno NV
<int>
bayonne 1 NA NA NA NA NA NA NA
manhattan NA 3 NA NA NA NA NA NA
moore NA NA 5 NA NA NA NA NA
new york NA NA NA 1 NA NA NA NA
newark NA NA NA NA 3 3 3 NA
reno NA NA NA NA NA NA NA 1

I can see one potential problem here: moore in MapNewsData could refer to a person (like Senator Moore) and not a city (like Moore OK). The best way to distinguish between cases like this is to know your data. Computers can’t replace brains when it comes to content knowledge (yet).

I am going to go ahead and join these data frames together and clean the dates.

MapNewsData <- dplyr::inner_join(MapNewsData, UsCity, by = "city_id")

separate, str_sub, and make_datetime

The dates (datetime) variable is not in a format I can use. I want to get the date components into a YYYY-MM-DD format. I’ll start this process with tidyr::separate()

# test
tidyr::separate(MapNewsData, 
         datetime, 
         into = c("month", "day", "year", "at", "hour", "min")) %>% 
    dplyr::select(headline,
                  month:min,
                  dplyr::everything()) %>% 
    dplyr::glimpse(75)
Observations: 2,603
Variables: 24
$ headline        <chr> "president trump, politicians take to twitter ...
$ month           <chr> "October", "November", "November", "December",...
$ day             <chr> "31", "27", "27", "10", "10", "23", "10", "19"...
$ year            <chr> "2017", "2017", "2017", "2017", "2017", "2017"...
$ at              <chr> "at", "at", "at", "at", "at", "at", "at", "at"...
$ hour            <chr> "06", "08", "08", "08", "10", "09", "05", "10"...
$ min             <chr> "49PM", "51AM", "55AM", "56AM", "56AM", "06PM"...
$ data_id         <chr> "1", "1", "1", "1", "1", "2", "2", "1", "1", "...
$ url             <chr> "http://abc7ny.com/trump-tweets-after-nyc-ramp...
$ teaser          <chr> "President Donald Trump called the deadly truc...
$ feed_name       <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", ...
$ feed_url        <chr> "http://abc7ny.com/feed", "http://abc7ny.com/f...
$ teaser_3_words  <fct> President Donald Trump, President Donald Trump...
$ tease_3rd_count <int> 216, 216, 216, 216, 216, 216, 216, 114, 114, 1...
$ newsfeed_n      <int> 9745, 9745, 9745, 9745, 9745, 13020, 13020, 97...
$ tease_3rd_prop  <dbl> 0.02217, 0.02217, 0.02217, 0.02217, 0.02217, 0...
$ data_id_fct     <fct> Abc7, Abc7, Abc7, Abc7, Abc7, Kcra, Kcra, Abc7...
$ city_id         <chr> "new york", "moore", "moore", "moore", "moore"...
$ name            <chr> "New York NY", "Moore OK", "Moore OK", "Moore ...
$ country.etc     <chr> "NY", "OK", "OK", "OK", "OK", "NV", "OK", "KS"...
$ pop             <int> 8124427, 48749, 48749, 48749, 48749, 206626, 4...
$ lat             <dbl> 40.67, 35.33, 35.33, 35.33, 35.33, 39.54, 35.3...
$ long            <dbl> -73.94, -97.47, -97.47, -97.47, -97.47, -119.8...
$ capital         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Ok now that I’ve separated the components, I need to convert them all into formats I can use. Follow the comments in the pipeline below to see how these work.

    tidyr::separate(MapNewsData, 
         datetime, 
          into = c("month", 
                  "day", 
                  "year", 
                  "at", 
                  "hour", 
                  "min")) %>% # I know this works
    dplyr::mutate( # new variables
         month = match(month, month.name), # month.name is loaded with R
           day = as.numeric(day), # make numeric day
          year = as.numeric(year), # # make numeric year
      am_or_pm = stringr::str_sub(min, 
                           start = 3, 
                             end = 4),  # break up min from AM/PM
          hour = if_else(am_or_pm %in% "PM", # condition for 24H time
                 as.numeric(hour) + 12, # 24 hour
                 as.numeric(hour)),
        minute = stringr::str_sub(min, 
                         start = 1, # get the minute elemets
                           end = 2),
           min = as.numeric(minute), # format the minutes
          date = lubridate::make_date(
                   year = year,
                  month = month,
                    day = day), 
      datetime = lubridate::make_datetime( # create the datetime
                   year = year, 
                  month = month, 
                    day = day, 
                   hour = hour, 
                    min = min)) %>% 
    dplyr::glimpse(75)
Observations: 2,603
Variables: 28
$ headline        <chr> "president trump, politicians take to twitter ...
$ data_id         <chr> "1", "1", "1", "1", "1", "2", "2", "1", "1", "...
$ month           <int> 10, 11, 11, 12, 12, 8, 11, 7, 7, 7, 7, 7, 7, 7...
$ day             <dbl> 31, 27, 27, 10, 10, 23, 10, 19, 21, 30, 30, 30...
$ year            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ at              <chr> "at", "at", "at", "at", "at", "at", "at", "at"...
$ hour            <dbl> 18, 8, 8, 8, 10, 21, 5, 22, 23, 8, 8, 8, 8, 8,...
$ min             <dbl> 49, 51, 55, 56, 56, 6, 24, 42, 7, 8, 8, 8, 15,...
$ url             <chr> "http://abc7ny.com/trump-tweets-after-nyc-ramp...
$ teaser          <chr> "President Donald Trump called the deadly truc...
$ feed_name       <chr> "abc7ny.com RSS Feed", "abc7ny.com RSS Feed", ...
$ feed_url        <chr> "http://abc7ny.com/feed", "http://abc7ny.com/f...
$ teaser_3_words  <fct> President Donald Trump, President Donald Trump...
$ tease_3rd_count <int> 216, 216, 216, 216, 216, 216, 216, 114, 114, 1...
$ newsfeed_n      <int> 9745, 9745, 9745, 9745, 9745, 13020, 13020, 97...
$ tease_3rd_prop  <dbl> 0.02217, 0.02217, 0.02217, 0.02217, 0.02217, 0...
$ data_id_fct     <fct> Abc7, Abc7, Abc7, Abc7, Abc7, Kcra, Kcra, Abc7...
$ city_id         <chr> "new york", "moore", "moore", "moore", "moore"...
$ name            <chr> "New York NY", "Moore OK", "Moore OK", "Moore ...
$ country.etc     <chr> "NY", "OK", "OK", "OK", "OK", "NV", "OK", "KS"...
$ pop             <int> 8124427, 48749, 48749, 48749, 48749, 206626, 4...
$ lat             <dbl> 40.67, 35.33, 35.33, 35.33, 35.33, 39.54, 35.3...
$ long            <dbl> -73.94, -97.47, -97.47, -97.47, -97.47, -119.8...
$ capital         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ am_or_pm        <chr> "PM", "AM", "AM", "AM", "AM", "PM", "AM", "PM"...
$ minute          <chr> "49", "51", "55", "56", "56", "06", "24", "42"...
$ date            <date> 2017-10-31, 2017-11-27, 2017-11-27, 2017-12-1...
$ datetime        <dttm> 2017-10-31 18:49:00, 2017-11-27 08:51:00, 201...

Now I will assign these new variables to MapNewsData and visualize the occurrence of our new variable teaser_3_words over time.

MapNewsData <- MapNewsData %>% 
    tidyr::separate(datetime, 
          into = c("month", 
                  "day", 
                  "year", 
                  "at", 
                  "hour", 
                  "min")) %>% # I know this works
    dplyr::mutate( # new variables
         month = match(month, month.name), # month.name is loaded with R
           day = as.numeric(day), # make numeric day
          year = as.numeric(year), # # make numeric year
      am_or_pm = str_sub(min, 
                           start = 3, 
                             end = 4),  # break up min from AM/PM
          hour = if_else(am_or_pm %in% "PM", # condition for 24H time
                 as.numeric(hour) + 12, # 24 hour
                 as.numeric(hour)),
        minute = str_sub(min, 
                         start = 1, # get the minute elemets
                           end = 2),
                    min = as.numeric(minute), # format the minutes
          date = make_date(
                   year = year,
                  month = month,
                    day = day),
      datetime = lubridate::make_datetime( # create the datetime
                   year = year, 
                  month = month, 
                    day = day, 
                   hour = hour, 
                    min = min))

First three teaser words over time

I want to look at the proportions of the first three words in teasers over time. This can serve as a crude estimate for how much the first three words of a teaser vary in proportion to the total number of teasers. I can do this by faceting like I did above, or I can use the cowplot::plot_grid() function to separate these graphs by news feed.

In order to use this function, I need to load the cowplot library. Then I want to split the data by the data_id_fct to create two plots with the tease_3rd_prop on the x axis and datetime on the y axis.

# create data subset
MapNewsData_Prop_Data <- MapNewsData %>% 
    arrange(desc(tease_3rd_prop)) %>% 
    dplyr::select(data_id_fct, 
                  tease_3rd_prop,
                  teaser_3_words,
                  datetime) 
MapNewsData_Prop_Data %>% glimpse(75)
## Observations: 1,075,720
## Variables: 4
## $ data_id_fct    <fct> Abc7, Abc7, Abc7, Abc7, Abc7, Abc7, Abc7, Abc7,...
## $ tease_3rd_prop <dbl> 0.02217, 0.02217, 0.02217, 0.02217, 0.02217, 0....
## $ teaser_3_words <fct> President Donald Trump, President Donald Trump,...
## $ datetime       <dttm> 2017-10-31 18:49:00, 2017-10-31 18:49:00, 2017...

Now I am going to create a base line plot. This allows me to set the colors for the text lables in the next geoms.

# create base line plot
MapNewsLinePlot <- MapNewsData_Prop_Data %>%
        ggplot2::ggplot(aes(x = datetime, 
                        y = tease_3rd_prop,
                        color = data_id_fct,
                        label = teaser_3_words)) + 
        ggplot2::geom_line(aes(group = data_id_fct)) 
MapNewsLinePlot

MapNewsLinePlotData-1

I want to know what the specific colors are in this plot. It turns out, all that information is stored in a data frame I can access using ggplot2::ggplot_build().

# get colors in this plot
MapNewsLinePlotData <- ggplot2::ggplot_build(MapNewsLinePlot)$data[[1]]
MapNewsLinePlotData %>% distinct(colour)
##    colour
## 1 #F8766D
## 2 #00BFC4

With my two color codes, I can set the color argument in the geom_text() for the highest occurring proportions.

MapNewsData_Prop_Plot <- MapNewsLinePlot +
        ggplot2::geom_text(data = filter(MapNewsData,
                                    tease_3rd_prop >= 0.020 &
                                    datetime <= "2017-11-15"),
                                    aes(label = teaser_3_words),
                                        vjust = 2,
                                        hjust = 1,
                                        show.legend = FALSE,
                                        color = "#F8766D") +
            ggplot2::geom_text(data = filter(MapNewsData,
                                    tease_3rd_prop >= 0.020 &
                                    datetime >= "2017-11-15" & 
                                    datetime <= "2017-12-7"),
                                    aes(label = teaser_3_words),
                                        vjust = 0.7,
                                        hjust = 0.9,
                                        show.legend = FALSE,
                                        color = "#F8766D") +
            ggplot2::geom_text(data = filter(MapNewsData,
                                    tease_3rd_prop >= 0.020 &
                                    datetime >= "2017-12-7"),
                                    aes(label = teaser_3_words),
                                        vjust = 2,
                                        hjust = 0.09,
                                        show.legend = FALSE,
                                        color = "#F8766D") +
            ggplot2::geom_text(data = filter(MapNewsData,
                                    tease_3rd_prop > 0.015 &
                                    datetime <= "2017-09-01"),
                                    aes(label = teaser_3_words),
                                        vjust = 0.9,
                                        hjust = 0.1,
                                        show.legend = FALSE,
                                        color = "#00BFC4") +
            ggplot2::geom_text(data = filter(MapNewsData,
                                    tease_3rd_prop > 0.015 &
                                    datetime >= "2017-11-01" & 
                                    data_id_fct == "Kcra"),
                                    aes(label = teaser_3_words),
                                        vjust = 0.9,
                                        hjust = 0.1,
                                        show.legend = FALSE,
                                        color = "#00BFC4") +
            ggplot2::theme(legend.position = "top") +
            ggplot2::labs(x = "Date",
                      y = NULL,
                      color = "News Feed",
                      caption = "Proportion First Three Words Appeared") + 
            ggplot2::ggtitle("ABC7 & KCRA Teaser First Three Word Occurrences July 18 to January 16")
MapNewsData_Prop_Plot

MapNewsData_Prop_Plot

ggsave("./image/MapNewsData_Prop_Plot.png", width = 7, height = 5, units = "in")

This gives us an idea for when a certain teaser or headline spiked in coverage between July and January in 2018.

Martin Frigaard is a graduate student at UCSF. Find him on Twitter.

Leave a Reply