Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Deduplication Algorithms in Practice | Deduplication Strategies
Quizzes & Challenges
Quizzes
Challenges
/
Data Cleaning Techniques in Python

bookDeduplication Algorithms in Practice

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.

12345678910111213141516
import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
copy
Note
Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

  1. Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
  2. Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
  3. If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
  4. The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
copy
question mark

Which deduplication approach is most suitable for a dataset containing both exact and near-duplicate customer names (e.g., "John Smith", "Jon Smith", "J. Smith")?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 1

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain how fuzzy deduplication differs from exact deduplication?

What are some common challenges when using fuzzy matching for deduplication?

How can I adjust the similarity threshold to get different deduplication results?

bookDeduplication Algorithms in Practice

Veeg om het menu te tonen

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.

12345678910111213141516
import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
copy
Note
Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

  1. Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
  2. Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
  3. If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
  4. The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
copy
question mark

Which deduplication approach is most suitable for a dataset containing both exact and near-duplicate customer names (e.g., "John Smith", "Jon Smith", "J. Smith")?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 1
some-alt