Easily clean up messy databases with fuzzy matching in R

September 16, 2025October 10, 2025 by Lucia Walinchus

One of the biggest challenges working with text data is the many different ways that people can enter the exact same information. A human knows that “St. Lucie, Florida,” “Saint Lucie, FL,” and “St Lucy, Florida” are probably all the same place, but a computer doesn’t.

“Fuzzy” matching pulls similarities between the letters in words and phrases to help group them together.

To begin, we will load our required libraries to work in R. You will need to install these packages first if you have not yet done so:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)
library(stringr)
library(stringdist)
## 
## Attaching package: 'stringdist'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggplot2)

Bringing in the Data

We will be using a real data set: the schedule for the 2025 Investigative Reporters and Editors (IRE) conference.

Here we will download the IRE schedule, then save the session information as a data frame. Let’s first find all the unique session names.

IRE <- rio::import("https://schedules.ire.org/ire-2025/ire-2025-schedule.json")

sessions <- as_tibble(unlist(str_split(IRE$track, ",")))

sessions <- sessions %>% 
  group_by(value) %>% 
  summarize(total=n()) %>% 
  rename(session_track=value) %>% 
  arrange(desc(total))

unique(sessions$session_track)

##  [1] "Broadcast"                              
##  [2] "Reporting and writing strategies"       
##  [3] "Beat reporting"                         
##  [4] " Beat reporting"                        
##  [5] "Tools & Tech"                           
##  [6] " Public records"                        
##  [7] " Story ideas"                           
##  [8] "Beginner"                               
##  [9] "Story ideas"                            
## [10] " Reporting and writing strategies"      
## [11] "Educators"                              
## [12] "International"                          
## [13] " Networking"                            
## [14] ""                                       
## [15] " Criminal legal system"                 
## [16] "Managing & Editing"                     
## [17] "Public records"                         
## [18] " Data analysis"                         
## [19] "\"Equity - inclusion - accessibility\"" 
## [20] "Audio/radio"                            
## [21] "Business"                               
## [22] "Career advice"                          
## [23] " Audio/radio"                           
## [24] "50th Anniversary"                       
## [25] "AI"                                     
## [26] "Climate"                                
## [27] "Criminal legal system"                  
## [28] " Broadcast"                             
## [29] " Tools & Tech"                          
## [30] "Attacks on the press"                   
## [31] "Data analysis"                          
## [32] "Deep dive"                              
## [33] " \"Equity - inclusion - accessibility\""
## [34] " Research"                              
## [35] " Security"                              
## [36] " Students"                              
## [37] "Advanced"                               
## [38] "Freelance"                              
## [39] "Intermediate"                           
## [40] "Networking"                             
## [41] " AI"                                    
## [42] " Attacks on the press"                  
## [43] " Beginner"                              
## [44] " Career advice"                         
## [45] " Data viz"                              
## [46] " Local news"                            
## [47] "Employment"                             
## [48] "Gender & reproductive rights"           
## [49] " 50th Anniversary"                      
## [50] " Business"                              
## [51] " International"                         
## [52] "Data viz"                               
## [53] "Local news"                             
## [54] " Advanced"                              
## [55] " Employment"                            
## [56] " Gender & reproductive rights"          
## [57] " Managing & Editing"                    
## [58] "Students"

Some of these are not typed in exactly right, so we get separate categories where we only wanted one. “Beat reporting” and “ Beat reporting” were probably meant to be the same category! But an errant space after the quotation mark in the second instance makes them different to a computer.

DON’T MISS How to use CartoDB to visualize animal tracking data

Fuzzy matching

We can find other similar entries with mistakes by doing some fuzzy matching.

First, we’ll create a vector of the names. Then, we’ll create a function to categorize similar names based on a threshold we set. You can play with the threshold to have it be more or less strict (i.e., more or fewer letters in common).

# Sample vector of names
names_vector <- sessions$session_track
# Function to categorize similar names
categorize_names <- function(names_vector, threshold = 0.2) {
  # Create an empty vector to store the labels
  labels <- rep(NA, length(names_vector))
  
  # Initialize the label counter
  label_counter <- 1
  
  # Loop through each name in the vector
  for (i in 1:length(names_vector)) {
    if (is.na(labels[i])) {
      # Assign a new label to the current name
      labels[i] <- label_counter
      
      # Compare the current name with the rest of the names
      for (j in (i + 1):length(names_vector)) {
        if (stringdist(names_vector[i], names_vector[j], method = "jw") < threshold) {
          # Assign the same label to similar names
          labels[j] <- label_counter
        }
      }
      
      # Increment the label counter
      label_counter <- label_counter + 1
    }
  }
  
  return(labels)
}
# Categorize the names and create a new vector with labels
name_labels <- categorize_names(names_vector)
# Print the original names and their corresponding labels
categories <- data.frame(Name = names_vector, Label = name_labels)
datatable(categories)

Each category now has a label. This is just a random number we assigned. We can now see “Beat Reporting” and ” Beat Reporting” are both in the same category (3).

But here we find a mistake! “ AI” is category 30 and “AI” is in category 20. From a human point of view, these are indistinguishable. But from a computer point of view, the former differs from the latter by 33%!

So we will fix this manually:

categories <- categories %>% 
  mutate(Label=ifelse(Name %in% c(" AI","AI"),20,Label)) #We could also have chosen 30 for both

How many categories do we have versus how many labels?

length(unique(categories$Name))
## [1] 58
length(unique(categories$Label))
## [1] 33

Looking at how the computer fuzzy matched these

Pro tip: always double-check your fuzzy matching to make sure the computer did it right. This is an extremely simple example and could probably have also been done just by trimming whitespace. But this is a really handy tool especially for matching up proper names. Longer names like “Cincinnati,” “Philadelphia,” and “Schuylkill” will never all come in perfectly!

DON’T MISS How to geocode a csv of addresses in R

We’re just checking here to make sure that each category has one label. And also so we can eyeball this and make sure it’s okay.

categories_with_labels <- categories %>% 
  group_by(Name,Label) %>% 
  summarize(Total=n()) %>% 
  arrange(desc(Label))

## `summarise()` has grouped output by 'Name'. You can override using the
## `.groups` argument.

datatable(categories_with_labels)

Now we’re going to concatenate each similar category with the same label.

categories_with_labels <- categories_with_labels %>% 
  group_by(Label) %>% 
  summarise(across(everything(), 
                   ~ paste(na.omit(.x), collapse = ", ")))
#.x inside the lambda represents each column’s values


DT::datatable(categories_with_labels)

Now we are going to match this data back up with our session data:

sessions <- sessions %>% 
  mutate(category_number= categories$Label[match(sessions$session_track,categories$Name)]) 
sessions <- sessions %>% 
  mutate(combined_category=categories_with_labels$Name[match(sessions$category_number, categories_with_labels$Label)])
sessions <- sessions %>% 
  mutate(track=sub(",.*", "",combined_category))
sessions <- sessions %>% 
  arrange(desc(total))


DT::datatable(sessions)

Visualizing the Results

What types of sessions will there be?

sessions %>% 
  group_by(combined_category) %>% 
  summarize(total=sum(total)) %>% 
ggplot( aes(reorder(combined_category,-total),total))+
  geom_col()+
  geom_text(aes(label = after_stat(y)), stat = "summary",fun = "sum", vjust = -0.5, size = 10)+
    theme(axis.text.x = element_text(angle = 75, vjust= .5, size = 20), axis.text.y = element_text(size = 20))+
  theme(legend.position = "none")

Final Thoughts

There are also some really great AI tools out there for fuzzy matching. The method in this tutorial is a bit cheaper and more environmentally friendly, but use whatever tool works for your project. Even with — and especially with — AI, remember to always double-check the results!

Author
Recent Posts

Lucia Walinchus

Lucia Walinchus is an award-winning journalist, attorney and ice hockey player/coach. She is currently the Data Editor for NBC-owned stations. Walinchus has written more than 500 articles for various publications throughout her career and was named a 2016 Fulbright Berlin Capital Program Scholar. She has been featured as a guest speaker on CNN and was a contracted freelancer for the New York Times and the Washington Post. Her work has previously been recognized as the Best Investigative Reporting in Ohio, the Best Data Journalism in Ohio, and the Best Public Service Journalism. Walinchus has a degree in Journalism from American University and a Juris Doctorate from California Western School of Law.

Latest posts by Lucia Walinchus (see all)

How to Analyze bluesky Posts and Trends with R - November 4, 2025
Easily clean up messy databases with fuzzy matching in R - September 16, 2025
How to use R to analyze racial profiling at police stops - November 21, 2023

Storybench

Easily clean up messy databases with fuzzy matching in R

Bringing in the Data

Fuzzy matching

Looking at how the computer fuzzy matched these

Visualizing the Results

Final Thoughts

One thought on “Easily clean up messy databases with fuzzy matching in R”

Leave a Reply Cancel reply

Get the latest from Storybench

Easily clean up messy databases with fuzzy matching in R

Bringing in the Data

Fuzzy matching

Looking at how the computer fuzzy matched these

Visualizing the Results

Final Thoughts

One thought on “Easily clean up messy databases with fuzzy matching in R”

Leave a Reply Cancel reply

read more

How to Analyze bluesky Posts and Trends with R

How to use Python to scrape for deleted government web pages

Get the latest from Storybench