Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Evaluating Deduplication Results | Deduplication Strategies
Quizzes & Challenges
Quizzes
Challenges
/
Data Cleaning Techniques in Python

bookEvaluating Deduplication Results

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

bookEvaluating Deduplication Results

Свайпніть щоб показати меню

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2
some-alt