Apprendre Understanding Weight-of-Evidence Encoding | Weight-of-Evidence and Leave-One-Out Encoding

Definition

Weight-of-Evidence (WoE) encoding is a powerful technique that originated in the field of credit scoring. It is used to transform categorical variables into numerical values, making them suitable for machine learning models, especially in binary classification tasks.

WoE encoding works by quantifying the predictive power of each category with respect to the target variable. In its original context, WoE helped credit analysts assess the risk associated with different customer groups by comparing the proportion of "good" and "bad" outcomes.

The mathematical formula for WoE is grounded in the concept of odds. For a given category, you calculate the proportion of positive outcomes (such as loan repayments) and the proportion of negative outcomes (such as loan defaults). The WoE for a category is then the natural logarithm of the ratio of these proportions. Formally, for a category $c$ , the WoE is calculated as:

WoE(c) = \ln \left( \frac{P(c | target = 1)}{P(c | target = 0)} \right)

where $target = 1$ represents the positive class, and $target = 0$ the negative class. This transformation creates a monotonic relationship between the encoded value and the likelihood of the target, which is particularly helpful for certain models like logistic regression.


              1234567891011121314151617181920212223242526272829303132333435363738
            
def compute_woe(feature, target):
    from math import log

    # Count total goods (target=1) and bads (target=0)
    total_good = sum(target)
    total_bad = len(target) - total_good

    # Get unique categories
    categories = set(feature)
    woe_dict = {}

    for cat in categories:
        # Indices where feature == cat
        idx = [i for i, val in enumerate(feature) if val == cat]
        # Number of goods and bads in this category
        good = sum(target[i] for i in idx)
        bad = len(idx) - good

        # Avoid division by zero
        good_prop = good / total_good if total_good > 0 else 0.0001
        bad_prop = bad / total_bad if total_bad > 0 else 0.0001

        # Avoid log(0)
        if good_prop == 0:
            good_prop = 0.0001
        if bad_prop == 0:
            bad_prop = 0.0001

        woe = log(good_prop / bad_prop)
        woe_dict[cat] = woe

    return woe_dict

# Example usage:
feature = ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C']
target = [1, 0, 1, 0, 1, 0, 0, 1]
woe_values = compute_woe(feature, target)
print(woe_values)

Step-by-step explanation of the `compute_woe` function

The compute_woe function takes a list of categorical feature values and a corresponding list of binary target values (0 or 1). It returns a dictionary mapping each category to its calculated Weight-of-Evidence (WoE) value. Here is how the function works:

Calculate total goods and bads:
- You first compute the total number of "good" outcomes by summing the target list (where target = 1).
- The total number of "bad" outcomes is the total length of the target list minus the number of goods (where target = 0).
Identify unique categories:
- The set of unique categories in the feature list is determined. This allows you to process each category separately.
Iterate over each category:
- For each unique category, you find the indices where the feature equals that category.
- You then count how many of those indices correspond to goods (target = 1) and how many correspond to bads (target = 0).
Compute proportions:
- For each category, you calculate the proportion of goods in that category by dividing the number of goods by the total number of goods in the dataset.
- Similarly, you calculate the proportion of bads in that category by dividing the number of bads by the total number of bads in the dataset.
Handle zero values to avoid division errors:
- To prevent division by zero or taking the logarithm of zero, you check if the total number of goods or bads is zero and substitute a small value (0.0001) if needed.
- If any computed proportion is zero, you also substitute it with 0.0001 to ensure mathematical stability.
Calculate WoE for each category:
- The WoE value for a category is the natural logarithm of the ratio of the good proportion to the bad proportion for that category.
- This value is stored in a dictionary with the category as the key.
Return the dictionary of WoE values:
- After processing all categories, the function returns a dictionary where each key is a category name and each value is its WoE encoding.

This approach allows you to transform categorical variables into meaningful numerical values that capture their relationship with the target variable, making them more suitable for machine learning models.

Definition

Information Value (IV) is a metric derived from WoE values that quantifies the predictive power of a feature. It is calculated as the sum over all categories of the difference in good and bad proportions, multiplied by the WoE for each category. Features with higher IV are generally more useful for classification tasks, and IV provides a direct link to the WoE encoding by summarizing its overall impact.

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 1

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Glissez pour afficher le menu

Definition

WoE(c) = \ln \left( \frac{P(c | target = 1)}{P(c | target = 0)} \right)


              1234567891011121314151617181920212223242526272829303132333435363738
            
def compute_woe(feature, target):
    from math import log

    # Count total goods (target=1) and bads (target=0)
    total_good = sum(target)
    total_bad = len(target) - total_good

    # Get unique categories
    categories = set(feature)
    woe_dict = {}

    for cat in categories:
        # Indices where feature == cat
        idx = [i for i, val in enumerate(feature) if val == cat]
        # Number of goods and bads in this category
        good = sum(target[i] for i in idx)
        bad = len(idx) - good

        # Avoid division by zero
        good_prop = good / total_good if total_good > 0 else 0.0001
        bad_prop = bad / total_bad if total_bad > 0 else 0.0001

        # Avoid log(0)
        if good_prop == 0:
            good_prop = 0.0001
        if bad_prop == 0:
            bad_prop = 0.0001

        woe = log(good_prop / bad_prop)
        woe_dict[cat] = woe

    return woe_dict

# Example usage:
feature = ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C']
target = [1, 0, 1, 0, 1, 0, 0, 1]
woe_values = compute_woe(feature, target)
print(woe_values)

Step-by-step explanation of the `compute_woe` function

Calculate total goods and bads:
- You first compute the total number of "good" outcomes by summing the target list (where target = 1).
- The total number of "bad" outcomes is the total length of the target list minus the number of goods (where target = 0).
Identify unique categories:
- The set of unique categories in the feature list is determined. This allows you to process each category separately.
Iterate over each category:
- For each unique category, you find the indices where the feature equals that category.
- You then count how many of those indices correspond to goods (target = 1) and how many correspond to bads (target = 0).
Compute proportions:
- For each category, you calculate the proportion of goods in that category by dividing the number of goods by the total number of goods in the dataset.
- Similarly, you calculate the proportion of bads in that category by dividing the number of bads by the total number of bads in the dataset.
Handle zero values to avoid division errors:
- To prevent division by zero or taking the logarithm of zero, you check if the total number of goods or bads is zero and substitute a small value (0.0001) if needed.
- If any computed proportion is zero, you also substitute it with 0.0001 to ensure mathematical stability.
Calculate WoE for each category:
- The WoE value for a category is the natural logarithm of the ratio of the good proportion to the bad proportion for that category.
- This value is stored in a dictionary with the category as the key.
Return the dictionary of WoE values:
- After processing all categories, the function returns a dictionary where each key is a category name and each value is its WoE encoding.

Definition

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 1

Understanding Weight-of-Evidence Encoding

Step-by-step explanation of the compute_woe function

Understanding Weight-of-Evidence Encoding

Step-by-step explanation of the compute_woe function

Step-by-step explanation of the `compute_woe` function

Step-by-step explanation of the `compute_woe` function