Data Journalism in R Tutorials

Easily clean up messy databases with fuzzy matching in R

One of the biggest challenges working with text data is the many different ways that people can enter the exact same information. A human knows that “St. Lucie, Florida,” “Saint Lucie, FL,” and “St Lucy, Florida” are probably all the same place, but a computer doesn’t.

“Fuzzy” matching pulls similarities between the letters in words and phrases to help group them together.

To begin, we will load our required libraries to work in R. You will need to install these packages first if you have not yet done so:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)
library(stringr)
library(stringdist)
## 
## Attaching package: 'stringdist'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggplot2)

Bringing in the Data

We will be using a real data set: the schedule for the 2025 Investigative Reporters and Editors (IRE) conference.

Here we will download the IRE schedule, then save the session information as a data frame. Let’s first find all the unique session names.

IRE <- rio::import("https://schedules.ire.org/ire-2025/ire-2025-schedule.json")

sessions <- as_tibble(unlist(str_split(IRE$track, ",")))

sessions <- sessions %>% 
  group_by(value) %>% 
  summarize(total=n()) %>% 
  rename(session_track=value) %>% 
  arrange(desc(total))

unique(sessions$session_track)
##  [1] "Broadcast"                              
##  [2] "Reporting and writing strategies"       
##  [3] "Beat reporting"                         
##  [4] " Beat reporting"                        
##  [5] "Tools & Tech"                           
##  [6] " Public records"                        
##  [7] " Story ideas"                           
##  [8] "Beginner"                               
##  [9] "Story ideas"                            
## [10] " Reporting and writing strategies"      
## [11] "Educators"                              
## [12] "International"                          
## [13] " Networking"                            
## [14] ""                                       
## [15] " Criminal legal system"                 
## [16] "Managing & Editing"                     
## [17] "Public records"                         
## [18] " Data analysis"                         
## [19] "\"Equity - inclusion - accessibility\"" 
## [20] "Audio/radio"                            
## [21] "Business"                               
## [22] "Career advice"                          
## [23] " Audio/radio"                           
## [24] "50th Anniversary"                       
## [25] "AI"                                     
## [26] "Climate"                                
## [27] "Criminal legal system"                  
## [28] " Broadcast"                             
## [29] " Tools & Tech"                          
## [30] "Attacks on the press"                   
## [31] "Data analysis"                          
## [32] "Deep dive"                              
## [33] " \"Equity - inclusion - accessibility\""
## [34] " Research"                              
## [35] " Security"                              
## [36] " Students"                              
## [37] "Advanced"                               
## [38] "Freelance"                              
## [39] "Intermediate"                           
## [40] "Networking"                             
## [41] " AI"                                    
## [42] " Attacks on the press"                  
## [43] " Beginner"                              
## [44] " Career advice"                         
## [45] " Data viz"                              
## [46] " Local news"                            
## [47] "Employment"                             
## [48] "Gender & reproductive rights"           
## [49] " 50th Anniversary"                      
## [50] " Business"                              
## [51] " International"                         
## [52] "Data viz"                               
## [53] "Local news"                             
## [54] " Advanced"                              
## [55] " Employment"                            
## [56] " Gender & reproductive rights"          
## [57] " Managing & Editing"                    
## [58] "Students"

Some of these are not typed in exactly right, so we get separate categories where we only wanted one. “Beat reporting” and “ Beat reporting” were probably meant to be the same category! But an errant space after the quotation mark in the second instance makes them different to a computer.

DON’T MISS  Investigating megabits: How Leon Yin and The Markup uncovered the story of ISP inequality and digital redlining

Fuzzy matching

We can find other similar entries with mistakes by doing some fuzzy matching.

First, we’ll create a vector of the names. Then, we’ll create a function to categorize similar names based on a threshold we set. You can play with the threshold to have it be more or less strict (i.e., more or fewer letters in common).

# Sample vector of names
names_vector <- sessions$session_track
# Function to categorize similar names
categorize_names <- function(names_vector, threshold = 0.2) {
  # Create an empty vector to store the labels
  labels <- rep(NA, length(names_vector))
  
  # Initialize the label counter
  label_counter <- 1
  
  # Loop through each name in the vector
  for (i in 1:length(names_vector)) {
    if (is.na(labels[i])) {
      # Assign a new label to the current name
      labels[i] <- label_counter
      
      # Compare the current name with the rest of the names
      for (j in (i + 1):length(names_vector)) {
        if (stringdist(names_vector[i], names_vector[j], method = "jw") < threshold) {
          # Assign the same label to similar names
          labels[j] <- label_counter
        }
      }
      
      # Increment the label counter
      label_counter <- label_counter + 1
    }
  }
  
  return(labels)
}
# Categorize the names and create a new vector with labels
name_labels <- categorize_names(names_vector)
# Print the original names and their corresponding labels
categories <- data.frame(Name = names_vector, Label = name_labels)
datatable(categories)

Each category now has a label. This is just a random number we assigned. We can now see “Beat Reporting” and ” Beat Reporting” are both in the same category (3).

But here we find a mistake! “ AI” is category 30 and “AI” is in category 20. From a human point of view, these are indistinguishable. But from a computer point of view, the former differs from the latter by 33%!

So we will fix this manually:

categories <- categories %>% 
  mutate(Label=ifelse(Name %in% c(" AI","AI"),20,Label)) #We could also have chosen 30 for both

How many categories do we have versus how many labels?

length(unique(categories$Name))
## [1] 58
length(unique(categories$Label))
## [1] 33

Looking at how the computer fuzzy matched these

Pro tip: always double-check your fuzzy matching to make sure the computer did it right. This is an extremely simple example and could probably have also been done just by trimming whitespace. But this is a really handy tool especially for matching up proper names. Longer names like “Cincinnati,” “Philadelphia,” and “Schuylkill” will never all come in perfectly!

DON’T MISS  How to make your images interactive with audio, video and links

We’re just checking here to make sure that each category has one label. And also so we can eyeball this and make sure it’s okay.

categories_with_labels <- categories %>% 
  group_by(Name,Label) %>% 
  summarize(Total=n()) %>% 
  arrange(desc(Label))
## `summarise()` has grouped output by 'Name'. You can override using the
## `.groups` argument.
datatable(categories_with_labels)

Now we’re going to concatenate each similar category with the same label.

categories_with_labels <- categories_with_labels %>% 
  group_by(Label) %>% 
  summarise(across(everything(), 
                   ~ paste(na.omit(.x), collapse = ", ")))
#.x inside the lambda represents each column’s values


DT::datatable(categories_with_labels)

Now we are going to match this data back up with our session data:

sessions <- sessions %>% 
  mutate(category_number= categories$Label[match(sessions$session_track,categories$Name)]) 
sessions <- sessions %>% 
  mutate(combined_category=categories_with_labels$Name[match(sessions$category_number, categories_with_labels$Label)])
sessions <- sessions %>% 
  mutate(track=sub(",.*", "",combined_category))
sessions <- sessions %>% 
  arrange(desc(total))


DT::datatable(sessions)

Visualizing the Results

What types of sessions will there be?

sessions %>% 
  group_by(combined_category) %>% 
  summarize(total=sum(total)) %>% 
ggplot( aes(reorder(combined_category,-total),total))+
  geom_col()+
  geom_text(aes(label = after_stat(y)), stat = "summary",fun = "sum", vjust = -0.5, size = 10)+
    theme(axis.text.x = element_text(angle = 75, vjust= .5, size = 20), axis.text.y = element_text(size = 20))+
  theme(legend.position = "none")

Final Thoughts

There are also some really great AI tools out there for fuzzy matching. The method in this tutorial is a bit cheaper and more environmentally friendly, but use whatever tool works for your project. Even with — and especially with — AI, remember to always double-check the results!

Lucia Walinchus

One thought on “Easily clean up messy databases with fuzzy matching in R

Leave a Reply

Your email address will not be published. Required fields are marked *

Get the latest from Storybench

Keep up with tutorials, behind-the-scenes interviews and more.