Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Deduplication Algorithms in Practice | Deduplication Strategies
Data Cleaning Techniques in Python

bookDeduplication Algorithms in Practice

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.

12345678910111213141516
import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
copy
Note
Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

  1. Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
  2. Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
  3. If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
  4. The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
copy
question mark

Which deduplication approach is most suitable for a dataset containing both exact and near-duplicate customer names (e.g., "John Smith", "Jon Smith", "J. Smith")?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 1

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

bookDeduplication Algorithms in Practice

Scorri per mostrare il menu

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.

12345678910111213141516
import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
copy
Note
Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

  1. Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
  2. Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
  3. If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
  4. The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
copy
question mark

Which deduplication approach is most suitable for a dataset containing both exact and near-duplicate customer names (e.g., "John Smith", "Jon Smith", "J. Smith")?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 1
some-alt