Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Evaluating Deduplication Results | Deduplication Strategies
Data Cleaning Techniques in Python

bookEvaluating Deduplication Results

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 2

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain how to interpret the precision, recall, and F1-score values in this context?

How do I calculate these metrics using the table data provided?

What are some ways to improve deduplication performance based on these results?

bookEvaluating Deduplication Results

Scorri per mostrare il menu

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 2
some-alt