Learn Understanding Fuzzy Matching | Fuzzy Matching and Similarity Detection

Swipe to show menu

Fuzzy matching is a powerful technique in data cleaning that helps you identify records that are similar but not exactly the same. In real-world datasets, you often encounter inconsistencies such as typos, abbreviations, or alternate spellings. For example, customer names like "Jon Smith" and "John Smith," or product names like "iPhone 11" and "iPhone11," may refer to the same entity but are not identical strings. Relying on exact matching would miss these subtle variations, potentially leading to duplicate records or missed connections. Fuzzy matching addresses these challenges by measuring the degree of similarity between strings, enabling more accurate data cleaning and deduplication.


              123456789101112131415161718192021222324
            
def levenshtein_distance(s1, s2):
    # Create a matrix to store distances
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Initialize the matrix
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # Compute distances
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i][j] = min(
                dp[i - 1][j] + 1,      # Deletion
                dp[i][j - 1] + 1,      # Insertion
                dp[i - 1][j - 1] + cost  # Substitution
            )
    return dp[m][n]

# Example usage:
print(levenshtein_distance("kitten", "sitting"))  # Output: 3

Levenshtein distance: measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. The Levenshtein distance $d(s_1, s_2)$ between strings $s_1$ and $s_2$ is computed using dynamic programming:

d(i, j) = \begin{cases} \max(i, j) & \text{if } \min(i, j) = 0 \\ \min \begin{cases} d(i-1, j) + 1 \\ d(i, j-1) + 1 \\ d(i-1, j-1) + [s_1[i] \neq s_2[j]] \end{cases} & \text{otherwise} \end{cases}

Use case: best for typo correction, minor spelling differences, and short string comparisons (such as names).

Jaccard similarity: measures the overlap between two sets. For sets $A$ and $B$ :

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Use case: useful for comparing unordered collections or tokenized text (e.g., sets of words, n-grams). It is robust to word order and works well for deduplication of lists or multi-word phrases.

Cosine similarity: compares the orientation (angle) between two non-zero vectors in a multi-dimensional space. For vectors $\vec{a}$ and $\vec{b}$ :

\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}

Use case: ideal for longer texts or when word frequency matters, such as document similarity and clustering. Commonly used with term frequency vectors in natural language processing.


              1234567891011121314151617
            
names = ["Jon Smith", "John Smith", "Jane Smyth", "Janet Smith", "J. Smith"]

def find_closest_name(query, name_list):
    min_distance = float("inf")
    closest_name = None
    for name in name_list:
        dist = levenshtein_distance(query, name)
        if dist < min_distance:
            min_distance = dist
            closest_name = name
    return closest_name, min_distance

# Example usage:
query_name = "John Smit"
match, distance = find_closest_name(query_name, names)
print(f"Closest match to '{query_name}': '{match}' (distance: {distance})")

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1