Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Understanding Weight-of-Evidence Encoding | Weight-of-Evidence and Leave-One-Out Encoding
Feature Encoding Methods in Python

bookUnderstanding Weight-of-Evidence Encoding

Note
Definition

Weight-of-Evidence (WoE) encoding is a powerful technique that originated in the field of credit scoring. It is used to transform categorical variables into numerical values, making them suitable for machine learning models, especially in binary classification tasks.

WoE encoding works by quantifying the predictive power of each category with respect to the target variable. In its original context, WoE helped credit analysts assess the risk associated with different customer groups by comparing the proportion of "good" and "bad" outcomes.

The mathematical formula for WoE is grounded in the concept of odds. For a given category, you calculate the proportion of positive outcomes (such as loan repayments) and the proportion of negative outcomes (such as loan defaults). The WoE for a category is then the natural logarithm of the ratio of these proportions. Formally, for a category cc, the WoE is calculated as:

WoE(c)=ln(P(ctarget=1)P(ctarget=0))WoE(c) = \ln \left( \frac{P(c | target = 1)}{P(c | target = 0)} \right)

where target=1target = 1 represents the positive class, and target=0target = 0 the negative class. This transformation creates a monotonic relationship between the encoded value and the likelihood of the target, which is particularly helpful for certain models like logistic regression.

1234567891011121314151617181920212223242526272829303132333435363738
def compute_woe(feature, target): from math import log # Count total goods (target=1) and bads (target=0) total_good = sum(target) total_bad = len(target) - total_good # Get unique categories categories = set(feature) woe_dict = {} for cat in categories: # Indices where feature == cat idx = [i for i, val in enumerate(feature) if val == cat] # Number of goods and bads in this category good = sum(target[i] for i in idx) bad = len(idx) - good # Avoid division by zero good_prop = good / total_good if total_good > 0 else 0.0001 bad_prop = bad / total_bad if total_bad > 0 else 0.0001 # Avoid log(0) if good_prop == 0: good_prop = 0.0001 if bad_prop == 0: bad_prop = 0.0001 woe = log(good_prop / bad_prop) woe_dict[cat] = woe return woe_dict # Example usage: feature = ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C'] target = [1, 0, 1, 0, 1, 0, 0, 1] woe_values = compute_woe(feature, target) print(woe_values)
copy

Step-by-step explanation of the compute_woe function

The compute_woe function takes a list of categorical feature values and a corresponding list of binary target values (0 or 1). It returns a dictionary mapping each category to its calculated Weight-of-Evidence (WoE) value. Here is how the function works:

  1. Calculate total goods and bads:

    • You first compute the total number of "good" outcomes by summing the target list (where target = 1).
    • The total number of "bad" outcomes is the total length of the target list minus the number of goods (where target = 0).
  2. Identify unique categories:

    • The set of unique categories in the feature list is determined. This allows you to process each category separately.
  3. Iterate over each category:

    • For each unique category, you find the indices where the feature equals that category.
    • You then count how many of those indices correspond to goods (target = 1) and how many correspond to bads (target = 0).
  4. Compute proportions:

    • For each category, you calculate the proportion of goods in that category by dividing the number of goods by the total number of goods in the dataset.
    • Similarly, you calculate the proportion of bads in that category by dividing the number of bads by the total number of bads in the dataset.
  5. Handle zero values to avoid division errors:

    • To prevent division by zero or taking the logarithm of zero, you check if the total number of goods or bads is zero and substitute a small value (0.0001) if needed.
    • If any computed proportion is zero, you also substitute it with 0.0001 to ensure mathematical stability.
  6. Calculate WoE for each category:

    • The WoE value for a category is the natural logarithm of the ratio of the good proportion to the bad proportion for that category.
    • This value is stored in a dictionary with the category as the key.
  7. Return the dictionary of WoE values:

    • After processing all categories, the function returns a dictionary where each key is a category name and each value is its WoE encoding.

This approach allows you to transform categorical variables into meaningful numerical values that capture their relationship with the target variable, making them more suitable for machine learning models.

Note
Definition

Information Value (IV) is a metric derived from WoE values that quantifies the predictive power of a feature. It is calculated as the sum over all categories of the difference in good and bad proportions, multiplied by the WoE for each category. Features with higher IV are generally more useful for classification tasks, and IV provides a direct link to the WoE encoding by summarizing its overall impact.

question mark

What is the main purpose of Weight-of-Evidence (WoE) encoding?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 1

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

bookUnderstanding Weight-of-Evidence Encoding

Glissez pour afficher le menu

Note
Definition

Weight-of-Evidence (WoE) encoding is a powerful technique that originated in the field of credit scoring. It is used to transform categorical variables into numerical values, making them suitable for machine learning models, especially in binary classification tasks.

WoE encoding works by quantifying the predictive power of each category with respect to the target variable. In its original context, WoE helped credit analysts assess the risk associated with different customer groups by comparing the proportion of "good" and "bad" outcomes.

The mathematical formula for WoE is grounded in the concept of odds. For a given category, you calculate the proportion of positive outcomes (such as loan repayments) and the proportion of negative outcomes (such as loan defaults). The WoE for a category is then the natural logarithm of the ratio of these proportions. Formally, for a category cc, the WoE is calculated as:

WoE(c)=ln(P(ctarget=1)P(ctarget=0))WoE(c) = \ln \left( \frac{P(c | target = 1)}{P(c | target = 0)} \right)

where target=1target = 1 represents the positive class, and target=0target = 0 the negative class. This transformation creates a monotonic relationship between the encoded value and the likelihood of the target, which is particularly helpful for certain models like logistic regression.

1234567891011121314151617181920212223242526272829303132333435363738
def compute_woe(feature, target): from math import log # Count total goods (target=1) and bads (target=0) total_good = sum(target) total_bad = len(target) - total_good # Get unique categories categories = set(feature) woe_dict = {} for cat in categories: # Indices where feature == cat idx = [i for i, val in enumerate(feature) if val == cat] # Number of goods and bads in this category good = sum(target[i] for i in idx) bad = len(idx) - good # Avoid division by zero good_prop = good / total_good if total_good > 0 else 0.0001 bad_prop = bad / total_bad if total_bad > 0 else 0.0001 # Avoid log(0) if good_prop == 0: good_prop = 0.0001 if bad_prop == 0: bad_prop = 0.0001 woe = log(good_prop / bad_prop) woe_dict[cat] = woe return woe_dict # Example usage: feature = ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C'] target = [1, 0, 1, 0, 1, 0, 0, 1] woe_values = compute_woe(feature, target) print(woe_values)
copy

Step-by-step explanation of the compute_woe function

The compute_woe function takes a list of categorical feature values and a corresponding list of binary target values (0 or 1). It returns a dictionary mapping each category to its calculated Weight-of-Evidence (WoE) value. Here is how the function works:

  1. Calculate total goods and bads:

    • You first compute the total number of "good" outcomes by summing the target list (where target = 1).
    • The total number of "bad" outcomes is the total length of the target list minus the number of goods (where target = 0).
  2. Identify unique categories:

    • The set of unique categories in the feature list is determined. This allows you to process each category separately.
  3. Iterate over each category:

    • For each unique category, you find the indices where the feature equals that category.
    • You then count how many of those indices correspond to goods (target = 1) and how many correspond to bads (target = 0).
  4. Compute proportions:

    • For each category, you calculate the proportion of goods in that category by dividing the number of goods by the total number of goods in the dataset.
    • Similarly, you calculate the proportion of bads in that category by dividing the number of bads by the total number of bads in the dataset.
  5. Handle zero values to avoid division errors:

    • To prevent division by zero or taking the logarithm of zero, you check if the total number of goods or bads is zero and substitute a small value (0.0001) if needed.
    • If any computed proportion is zero, you also substitute it with 0.0001 to ensure mathematical stability.
  6. Calculate WoE for each category:

    • The WoE value for a category is the natural logarithm of the ratio of the good proportion to the bad proportion for that category.
    • This value is stored in a dictionary with the category as the key.
  7. Return the dictionary of WoE values:

    • After processing all categories, the function returns a dictionary where each key is a category name and each value is its WoE encoding.

This approach allows you to transform categorical variables into meaningful numerical values that capture their relationship with the target variable, making them more suitable for machine learning models.

Note
Definition

Information Value (IV) is a metric derived from WoE values that quantifies the predictive power of a feature. It is calculated as the sum over all categories of the difference in good and bad proportions, multiplied by the WoE for each category. Features with higher IV are generally more useful for classification tasks, and IV provides a direct link to the WoE encoding by summarizing its overall impact.

question mark

What is the main purpose of Weight-of-Evidence (WoE) encoding?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 1
some-alt