Вивчайте Evaluating Deduplication Results

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
            
import pandas as pd
from difflib import SequenceMatcher
from sklearn.metrics import precision_score, recall_score, f1_score

# Two datasets from different sources
df_a = pd.DataFrame({
    "id_a": [1, 2, 3, 4],
    "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"],
    "price": [999, 899, 350, 700]
})

df_b = pd.DataFrame({
    "id_b": ["A", "B", "C", "D"],
    "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"],
    "price": [995, 900, 349, 705]
})

# Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate)
y_true = [1, 1, 1, 0]  # Last pair intentionally marked as non-duplicate

# Generate similarity features
def name_similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

def price_difference(a, b):
    return abs(a - b) / max(a, b)

pairs = []
for i in range(len(df_a)):
    sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"])
    diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"])

    pairs.append([sim_name, diff_price])

pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"])

# Simple duplicate classification rule
# High name similarity AND low price difference → duplicate
y_pred = []
for _, row in pairs_df.iterrows():
    if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05:
        y_pred.append(1)
    else:
        y_pred.append(0)

# Metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("Similarity pairs:")
print(pairs_df)
print("\nMetrics:")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1-score:  {f1:.2f}")

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Свайпніть щоб показати меню

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
            
import pandas as pd
from difflib import SequenceMatcher
from sklearn.metrics import precision_score, recall_score, f1_score

# Two datasets from different sources
df_a = pd.DataFrame({
    "id_a": [1, 2, 3, 4],
    "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"],
    "price": [999, 899, 350, 700]
})

df_b = pd.DataFrame({
    "id_b": ["A", "B", "C", "D"],
    "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"],
    "price": [995, 900, 349, 705]
})

# Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate)
y_true = [1, 1, 1, 0]  # Last pair intentionally marked as non-duplicate

# Generate similarity features
def name_similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

def price_difference(a, b):
    return abs(a - b) / max(a, b)

pairs = []
for i in range(len(df_a)):
    sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"])
    diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"])

    pairs.append([sim_name, diff_price])

pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"])

# Simple duplicate classification rule
# High name similarity AND low price difference → duplicate
y_pred = []
for _, row in pairs_df.iterrows():
    if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05:
        y_pred.append(1)
    else:
        y_pred.append(0)

# Metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("Similarity pairs:")
print(pairs_df)
print("\nMetrics:")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1-score:  {f1:.2f}")

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2