Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Understanding Fuzzy Matching | Fuzzy Matching and Similarity Detection
Quizzes & Challenges
Quizzes
Challenges
/
Data Cleaning Techniques in Python

bookUnderstanding Fuzzy Matching

Fuzzy matching is a powerful technique in data cleaning that helps you identify records that are similar but not exactly the same. In real-world datasets, you often encounter inconsistencies such as typos, abbreviations, or alternate spellings. For example, customer names like "Jon Smith" and "John Smith," or product names like "iPhone 11" and "iPhone11," may refer to the same entity but are not identical strings. Relying on exact matching would miss these subtle variations, potentially leading to duplicate records or missed connections. Fuzzy matching addresses these challenges by measuring the degree of similarity between strings, enabling more accurate data cleaning and deduplication.

123456789101112131415161718192021222324
def levenshtein_distance(s1, s2): # Create a matrix to store distances m, n = len(s1), len(s2) dp = [[0] * (n + 1) for _ in range(m + 1)] # Initialize the matrix for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j # Compute distances for i in range(1, m + 1): for j in range(1, n + 1): cost = 0 if s1[i - 1] == s2[j - 1] else 1 dp[i][j] = min( dp[i - 1][j] + 1, # Deletion dp[i][j - 1] + 1, # Insertion dp[i - 1][j - 1] + cost # Substitution ) return dp[m][n] # Example usage: print(levenshtein_distance("kitten", "sitting")) # Output: 3
copy

Levenshtein distance: measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. The Levenshtein distance d(s1,s2)d(s_1, s_2) between strings s1s_1 and s2s_2 is computed using dynamic programming:

d(i,j)={max(i,j)if min(i,j)=0min{d(i1,j)+1d(i,j1)+1d(i1,j1)+[s1[i]s2[j]]otherwised(i, j) = \begin{cases} \max(i, j) & \text{if } \min(i, j) = 0 \\ \min \begin{cases} d(i-1, j) + 1 \\ d(i, j-1) + 1 \\ d(i-1, j-1) + [s_1[i] \neq s_2[j]] \end{cases} & \text{otherwise} \end{cases}
  • Use case: best for typo correction, minor spelling differences, and short string comparisons (such as names).

Jaccard similarity: measures the overlap between two sets. For sets AA and BB:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}
  • Use case: useful for comparing unordered collections or tokenized text (e.g., sets of words, n-grams). It is robust to word order and works well for deduplication of lists or multi-word phrases.

Cosine similarity: compares the orientation (angle) between two non-zero vectors in a multi-dimensional space. For vectors a\vec{a} and b\vec{b}:

cosine(a,b)=abab\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}
  • Use case: ideal for longer texts or when word frequency matters, such as document similarity and clustering. Commonly used with term frequency vectors in natural language processing.
1234567891011121314151617
names = ["Jon Smith", "John Smith", "Jane Smyth", "Janet Smith", "J. Smith"] def find_closest_name(query, name_list): min_distance = float("inf") closest_name = None for name in name_list: dist = levenshtein_distance(query, name) if dist < min_distance: min_distance = dist closest_name = name return closest_name, min_distance # Example usage: query_name = "John Smit" match, distance = find_closest_name(query_name, names) print(f"Closest match to '{query_name}': '{match}' (distance: {distance})")
copy
question mark

Which of the following best describes the difference between exact matching and fuzzy matching in data cleaning?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 1

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain how the Levenshtein distance function works step by step?

What are some practical applications of fuzzy matching in real-world scenarios?

How do I choose between Levenshtein, Jaccard, and Cosine similarity for my data?

bookUnderstanding Fuzzy Matching

Pyyhkäise näyttääksesi valikon

Fuzzy matching is a powerful technique in data cleaning that helps you identify records that are similar but not exactly the same. In real-world datasets, you often encounter inconsistencies such as typos, abbreviations, or alternate spellings. For example, customer names like "Jon Smith" and "John Smith," or product names like "iPhone 11" and "iPhone11," may refer to the same entity but are not identical strings. Relying on exact matching would miss these subtle variations, potentially leading to duplicate records or missed connections. Fuzzy matching addresses these challenges by measuring the degree of similarity between strings, enabling more accurate data cleaning and deduplication.

123456789101112131415161718192021222324
def levenshtein_distance(s1, s2): # Create a matrix to store distances m, n = len(s1), len(s2) dp = [[0] * (n + 1) for _ in range(m + 1)] # Initialize the matrix for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j # Compute distances for i in range(1, m + 1): for j in range(1, n + 1): cost = 0 if s1[i - 1] == s2[j - 1] else 1 dp[i][j] = min( dp[i - 1][j] + 1, # Deletion dp[i][j - 1] + 1, # Insertion dp[i - 1][j - 1] + cost # Substitution ) return dp[m][n] # Example usage: print(levenshtein_distance("kitten", "sitting")) # Output: 3
copy

Levenshtein distance: measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. The Levenshtein distance d(s1,s2)d(s_1, s_2) between strings s1s_1 and s2s_2 is computed using dynamic programming:

d(i,j)={max(i,j)if min(i,j)=0min{d(i1,j)+1d(i,j1)+1d(i1,j1)+[s1[i]s2[j]]otherwised(i, j) = \begin{cases} \max(i, j) & \text{if } \min(i, j) = 0 \\ \min \begin{cases} d(i-1, j) + 1 \\ d(i, j-1) + 1 \\ d(i-1, j-1) + [s_1[i] \neq s_2[j]] \end{cases} & \text{otherwise} \end{cases}
  • Use case: best for typo correction, minor spelling differences, and short string comparisons (such as names).

Jaccard similarity: measures the overlap between two sets. For sets AA and BB:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}
  • Use case: useful for comparing unordered collections or tokenized text (e.g., sets of words, n-grams). It is robust to word order and works well for deduplication of lists or multi-word phrases.

Cosine similarity: compares the orientation (angle) between two non-zero vectors in a multi-dimensional space. For vectors a\vec{a} and b\vec{b}:

cosine(a,b)=abab\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}
  • Use case: ideal for longer texts or when word frequency matters, such as document similarity and clustering. Commonly used with term frequency vectors in natural language processing.
1234567891011121314151617
names = ["Jon Smith", "John Smith", "Jane Smyth", "Janet Smith", "J. Smith"] def find_closest_name(query, name_list): min_distance = float("inf") closest_name = None for name in name_list: dist = levenshtein_distance(query, name) if dist < min_distance: min_distance = dist closest_name = name return closest_name, min_distance # Example usage: query_name = "John Smit" match, distance = find_closest_name(query_name, names) print(f"Closest match to '{query_name}': '{match}' (distance: {distance})")
copy
question mark

Which of the following best describes the difference between exact matching and fuzzy matching in data cleaning?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 1
some-alt