Impara Deduplication Algorithms in Practice

Scorri per mostrare il menu

Deduplication is a crucial process in data cleaning, aiming to identify and remove redundant records from datasets. As data grows in size and complexity, duplicate entries can occur due to user input errors, system merges, or inconsistent data sources. These duplicates can negatively impact analysis, reporting, and machine learning models. Simple manual methods quickly become inefficient and unreliable as data scales, making algorithmic solutions essential for robust deduplication. The main challenges include recognizing duplicates that are not exactly identical, handling large datasets efficiently, and minimizing false positives or negatives in the deduplication process.


              12345678910111213141516
            
import pandas as pd

# Sample data with exact duplicates
data = {
    "id": [1, 2, 3, 2, 4, 1],
    "name": ["Alice", "Bob", "Charlie", "Bob", "David", "Alice"]
}
df = pd.DataFrame(data)

# Deduplication using hashing (removes exact duplicates)
deduped_df = df.drop_duplicates()

print("Original DataFrame:")
print(df)
print("\nDeduplicated DataFrame (Exact Duplicates Removed):")
print(deduped_df)

Study More

Exact deduplication, such as using hashing or drop_duplicates, works only when records are perfectly identical. However, real-world data often contains near-duplicates with minor variations, such as typos or formatting differences. Approximate deduplication methods, including fuzzy matching and similarity scoring, are required to handle these cases effectively.

The code below demonstrates fuzzy deduplication of names using vectorization and cosine similarity.

Each name is broken down into character n-grams (substrings of length 2 to 4) using TfidfVectorizer. This converts each name into a numeric vector that captures its character patterns.
Cosine similarity is used to compare these vectors, producing a similarity score between every pair of names.
If two names have a similarity score above the chosen threshold (0.85), they are considered near-duplicates.
The code removes all but one occurrence from each group of similar names, resulting in a deduplicated list that preserves only unique or sufficiently different names.


              1234567891011121314151617181920212223242526272829303132
            
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample data with near-duplicates
data = {
    "name": ["Jon Smith", "John Smith", "J. Smith", "Jane Doe", "Jane D.", "Janet Doe"]
}
df = pd.DataFrame(data)

# Vectorize the names for similarity comparison
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 4))
tfidf_matrix = vectorizer.fit_transform(df["name"])

# Compute pairwise cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

# Mark duplicates using a similarity threshold
threshold = 0.85
duplicates = set()
for i in range(similarity_matrix.shape[0]):
    for j in range(i+1, similarity_matrix.shape[0]):
        if similarity_matrix[i, j] > threshold:
            duplicates.add(j)

# Create deduplicated DataFrame
deduped_df = df.drop(df.index[list(duplicates)])

print("Original names:")
print(df["name"])
print("\nDeduplicated names (Fuzzy Matching):")
print(deduped_df["name"])

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 1

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 2. Capitolo 1