Aprenda ROC Curve and AUC | Classification Metrics

To assess how well a binary classifier distinguishes between two classes across all possible thresholds, you use the Receiver Operating Characteristic (ROC) curve. The ROC curve visualizes the trade-off between the true positive rate (TPR, also called sensitivity or recall) and the false positive rate (FPR) as you vary the classification threshold.

True Positive Rate (TPR) is the proportion of actual positives correctly identified by the classifier. It is calculated as:
$\text{TPR} = \frac{TP}{TP + FN}$
False Positive Rate (FPR) is the proportion of actual negatives that are incorrectly classified as positive. It is calculated as:
$\text{FPR} = \frac{FP}{FP + TN}$

By plotting TPR against FPR for every threshold, the ROC curve provides a comprehensive picture of a model’s performance, rather than focusing on a single decision point. The Area Under the Curve (AUC) summarizes this performance: a higher AUC means the model is better at distinguishing between the positive and negative classes across all thresholds.


              12345678910111213141516171819202122232425262728293031323334353637383940
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Generate synthetic binary classification data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
    n_classes=2, random_state=42
)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit a logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Get predicted probabilities for the positive class
y_scores = clf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores)

# Compute AUC
auc_score = roc_auc_score(y_test, y_scores)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {auc_score:.2f})")
plt.plot([0, 1], [0, 1], "k--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

When you interpret the ROC curve, a curve that bows toward the top left corner indicates a strong classifier, as it achieves high true positive rates with low false positive rates. The AUC quantifies this: an AUC of 0.5 means the classifier performs no better than random guessing, while an AUC of 1.0 indicates perfect discrimination between classes. Generally, an AUC above 0.8 is considered good, while values closer to 1.0 are excellent. However, the context of your problem and the class distribution should always guide your interpretation of ROC and AUC results.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 4

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain how to interpret the ROC curve and AUC in more detail?

What are some limitations of using ROC curves for model evaluation?

How does class imbalance affect the ROC curve and AUC?

Awesome!

Completion rate improved to 6.25

Deslize para mostrar o menu

True Positive Rate (TPR) is the proportion of actual positives correctly identified by the classifier. It is calculated as:
$\text{TPR} = \frac{TP}{TP + FN}$
False Positive Rate (FPR) is the proportion of actual negatives that are incorrectly classified as positive. It is calculated as:
$\text{FPR} = \frac{FP}{FP + TN}$


              12345678910111213141516171819202122232425262728293031323334353637383940
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Generate synthetic binary classification data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
    n_classes=2, random_state=42
)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit a logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Get predicted probabilities for the positive class
y_scores = clf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores)

# Compute AUC
auc_score = roc_auc_score(y_test, y_scores)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {auc_score:.2f})")
plt.plot([0, 1], [0, 1], "k--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 4