Evaluating Deduplication Results
Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:
- Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
- Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
- F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.
Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Can you explain how to interpret the precision, recall, and F1-score values in this context?
How do I calculate these metrics using the table data provided?
What are some ways to improve deduplication performance based on these results?
Fantastisk!
Completion rate forbedret til 8.33
Evaluating Deduplication Results
Stryg for at vise menuen
Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:
- Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
- Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
- F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.
Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
Tak for dine kommentarer!