Deduplication Algorithms in Practice
Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.
12345678910111213141516import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.
The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.
- Each name is broken down into character n-grams (substrings of length 2 to 4) using
TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns. - Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
- If two names have a similarity score above the chosen threshold (
0.85), they are considered near-duplicates. - The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Deduplication Algorithms in Practice
Swipe to show menu
Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.
12345678910111213141516import pandas as pd # Sample data with exact duplicates data = { "id": [1, 2, 3, 2, 4, 1], "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"] } df = pd.DataFrame(data) # Deduplication using hashing (removes exact duplicates) deduped_df = df.drop_duplicates() print("Original DataFrame:") print(df) print("\nDeduplicated DataFrame (Exact Duplicates Removed):") print(deduped_df)
Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.
The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.
- Each name is broken down into character n-grams (substrings of length 2 to 4) using
TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns. - Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
- If two names have a similarity score above the chosen threshold (
0.85), they are considered near-duplicates. - The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.
1234567891011121314151617181920212223242526272829303132import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample data with near-duplicates data = { "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"] } df = pd.DataFrame(data) # Vectorize the names for similarity comparison vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4)) tfidf_matrix = vectorizer.fit_transform(df["name"]) # Compute pairwise cosine similarity similarity_matrix = cosine_similarity(tfidf_matrix) # Mark duplicates using a similarity threshold threshold = 0.85 duplicates = set() for i in range(similarity_matrix.shape[0]): for j in range(i+1, similarity_matrix.shape[0]): if similarity_matrix[i, j] > threshold: duplicates.add(j) # Create deduplicated DataFrame deduped_df = df.drop(df.index[list(duplicates)]) print("Original names:") print(df["name"]) print("\nDeduplicated names (Fuzzy Matching):") print(deduped_df["name"])
Thanks for your feedback!