Evaluating Deduplication Results
Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:
- Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
- Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
- F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.
Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Can you explain how to interpret the precision, recall, and F1-score values in this context?
How do I calculate these metrics using the table data provided?
What are some ways to improve deduplication performance based on these results?
Genial!
Completion tasa mejorada a 8.33
Evaluating Deduplication Results
Desliza para mostrar el menú
Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:
- Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
- Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
- F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.
Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
¡Gracias por tus comentarios!