Learn Deduplication Algorithms in Practice

Swipe to show menu

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.


              12345678910111213141516
            
import pandas as pd

# Sample data with exact duplicates
data = {
    "id": [1, 2, 3, 2, 4, 1],
    "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"]
}
df = pd.DataFrame(data)

# Deduplication using hashing (removes exact duplicates)
deduped_df = df.drop_duplicates()

print("Original DataFrame:")
print(df)
print("\nDeduplicated DataFrame (Exact Duplicates Removed):")
print(deduped_df)

Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.


              1234567891011121314151617181920212223242526272829303132
            
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample data with near-duplicates
data = {
    "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"]
}
df = pd.DataFrame(data)

# Vectorize the names for similarity comparison
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4))
tfidf_matrix = vectorizer.fit_transform(df["name"])

# Compute pairwise cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

# Mark duplicates using a similarity threshold
threshold = 0.85
duplicates = set()
for i in range(similarity_matrix.shape[0]):
    for j in range(i+1, similarity_matrix.shape[0]):
        if similarity_matrix[i, j] > threshold:
            duplicates.add(j)

# Create deduplicated DataFrame
deduped_df = df.drop(df.index[list(duplicates)])

print("Original names:")
print(df["name"])
print("\nDeduplicated names (Fuzzy Matching):")
print(deduped_df["name"])

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1