Learn Applying Fuzzy Matching to DataFrames | Fuzzy Matching and Similarity Detection

Swipe to show menu

Fuzzy matching lets you find similar but not identical entries in a pandas DataFrame, helping you handle typos, inconsistent formatting, and alternative spellings. Use this technique to detect near-duplicates, group related records, and flag inconsistencies for data cleaning tasks like deduplication, record linkage, and standardization.

Fuzzy matching compares strings with a similarity function, such as .ratio() from difflib.SequenceMatcher, which returns a score between 0 and 1. A score of 1.0 means the strings are identical; lower values show less similarity. Apply this function row-wise to two columns to create a new column of similarity scores.

The similarity threshold sets the minimum score needed for two values to count as a match. Adjusting this threshold helps you control how strict or lenient your matching is, so you can efficiently identify and address near-duplicates in your DataFrame.


              1234567891011121314151617181920
            
import pandas as pd
from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

data = {
    "ProductA": ["Apple iPhone", "Samsung Galaxy", "Google Pixel", "Apple iPone"],
    "ProductB": ["Apple iPhone", "Samsung Galaxi", "Google Pixel", "Apple iPhone"]
}

df = pd.DataFrame(data)

def compare_columns(row):
    return similarity(row["ProductA"], row["ProductB"])

df["SimilarityScore"] = df.apply(compare_columns, axis=1)
df["IsSimilar"] = df["SimilarityScore"] > 0.85

print(df)

Step 1: Choose columns to compare

Select the specific DataFrame columns that contain the values you want to compare for similarity, such as product names or customer records.

Step 2: Define or select a similarity function

Use a function that measures how similar two strings are. The SequenceMatcher from the difflib library provides a convenient .ratio() method for this purpose.

Step 3: Apply the similarity function row-wise to the selected columns

Use the .apply() method to compare each pair of values in the chosen columns, generating a similarity score for each row.

Step 4: Set a similarity threshold to determine what counts as a match

Decide on a cutoff value (such as 0.85) that defines when two values are considered similar enough to be flagged as a match.

Step 5: Flag or extract rows where the similarity exceeds the threshold

Create a new column or filter your DataFrame to highlight or extract only those rows where the similarity score is above your chosen threshold.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2