Tutorials

How to do basic text mining using Google Sheets

September 20, 2019September 20, 2019 by Aleszu Bajak

While techniques for text mining, sentiment analysis and other natural language processing are ubiquitous on the Internet, they’re not always accessible to students. I recently ran this tutorial in my journalism class at Northeastern University to help students answer some questions they generated around the Twitter timelines of the Democratic candidates running for U.S. president in 2020.

This simple-to-use technique employs two simple formulae: =REGEXMATCH(C2, “keyword”) and =COUNTIF(J2,”TRUE”) to answer questions like “When and how do the candidates talk about race, guns or climate?”

Getting the data

I used R’s “rtweet” package and this Storybench tutorial to access the Twitter timelines of 11 Democratic candidates running in the 2020 election. You might want to use datasets like Donald Trump’s tweets or State of the Union speeches.

candidate	tweets
amyklobuchar	919
AndrewYang	782
BernieSanders	662
BetoORourke	781
CoryBooker	768
ewarren	915
JoeBiden	914
JulianCastro	496
KamalaHarris	879
PeteButtigieg	623
TulsiGabbard	864

I then cleaned and dumped these 8,603 tweets into a Google Sheet and shared it with my students: bit.ly/digisoc2020tweets.

Looking for a specific keyword

Next, I inserted a new column named “climate” – or whatever keyword you’re looking for – and used a REGEXMATCH formula to look through the text of tweet (which I is stored in column C) for the presence of a specific word. That formula, =REGEXMATCH(C2, “climate”) , returns either TRUE or FALSE. (Note: I’ve hidden columns F, G and H).

Next, we’ll apply this formula to all tweets by double-clicking the bottom-right of the J2 cell where I’ve placed the formula.

Once that’s double-clicked, the formula will copy to the entire J column and search every cell in column C for the word “climate.” The first instance of a “TRUE” is in row 5 for us with Pete Buttigieg’s tweet about the climate debate.

DON’T MISS ‘Reeling’ in your audience: Emily Schario discusses The B-Side’s social media strategy

Counting the TRUEs and the FALSEs

Next, we’ll add a column, which I’ll call “climate_num,” that adds a 1 if the tweet contains the word “climate” and a 0 if it doesn’t. We achieve this with the =COUNTIF(J2, “TRUE”) formula. In other words, this formula places a 1 in the K column if it finds a “TRUE” in the J column.

Double-click the bottom-right corner of the K2 cell to apply this formula to the entire column.

Generate a pivot table to count words by candidate

Once we have a bunch of 0s and 1s it’s time to count them by category. In this case, we want to see which of the candidates is mentioning the word “climate” the most on Twitter within the timeframe of our tweets.

To create a pivot table, go to Data –> Pivot table. Create it in a new sheet.

Next, you’ll see the following:

Click the “Add” button next to Rows and click screen_name – or whichever category you wish to summarize your keyword counts by. You’ll see your pivot table take shape:

Next, click the “Add” button next to Values. Select user_name again but notice that this time it counts up the number of tweets per candidate.

Next, let’s bring in the keyword counts. Click the “Add” button next to Values again and then click climate_num. As long as the values are “Summarized by… SUM,” this will add up all the 1’s in the climate_num column. In other words, adding up the number of times each candidate has mentioned the word “climate” in this dataset.

Raw frequency versus proportion of tweets

While Bernie Sanders has mentioned the word “climate” the most times, we want to get a ratio to make sure he’s actually said it the most as a percentage of his total tweets. We can use =C2/B2*100 to give us the percentage of each candidates’ tweets mentioning the keyword.

Double-click the bottom-right corner of that cell and apply it to all candidates. Indeed Bernie Sanders has mentioned climate the most times compared to other candidates – at least in this dataset. Put another way, he mentions “climate” in roughly 7 of every 100 tweets.

DON’T MISS How to use CartoDB instead of Adobe Illustrator for rapid mapping

Thanks!

Author
Recent Posts

Aleszu Bajak

Aleszu Bajak was the founding editor of Storybench. He is currently the director of data visualization at the Urban Institute. Previously, he was a senior data reporter on USA TODAY's data team, part of the newspaper's national investigative unit. He is a former Knight Science Journalism Fellow at M.I.T., was a founding senior writer at Undark magazine and founding editor of Esquire Classic, a project resuscitating the magazine's archives. His work has appeared in The New York Times, The Washington Post, M.I.T. Technology Review and Nature.

Storybench