Confusion Matrix

When making a prediction for a binary classification problem, there are only four possible outcomes:

In the image above, the actual values (true labels) are ordered from top to bottom in descending order, while the predicted values are ordered from left to right in ascending order. This is the default layout used by Scikit-learn when displaying confusion matrices.

These outcomes are called true positive (TP), true negative (TN), false positive (FP), and false negative (FN). "true" or "false" indicates whether the prediction is correct, while "positive" or "negative" refers to whether the predicted class is 1 or 0.

This means there are two types of errors we can make: false positives and false negatives. false positive prediction is also known as a type 1 error, while a false negative prediction is referred to as a type 2 error.

Confusion Matrix

The first way to look at the model's performance is to organize the predictions into a confusion matrix like this:

You can build a confusion matrix in Python using the confusion_matrix() function from sklearn:

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)

For better visualization, you can use the heatmap() function from seaborn:

sns.heatmap(conf_matrix)

Here is an example of how to calculate the confusion matrix for a Random Forest prediction on the Titanic dataset:


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True);

We can also plot the percentages instead of the instance counts by using the normalize parameter:

conf_matrix = confusion_matrix(y_true, y_pred, normalize='all')


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, normalize='all')
sns.heatmap(conf_matrix, annot=True);

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 5. Capítulo 1

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Classification with Python

Confusion Matrix

When making a prediction for a binary classification problem, there are only four possible outcomes:

Confusion Matrix

The first way to look at the model's performance is to organize the predictions into a confusion matrix like this:

You can build a confusion matrix in Python using the confusion_matrix() function from sklearn:

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)

For better visualization, you can use the heatmap() function from seaborn:

sns.heatmap(conf_matrix)

Here is an example of how to calculate the confusion matrix for a Random Forest prediction on the Titanic dataset:


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True);

We can also plot the percentages instead of the instance counts by using the normalize parameter:

conf_matrix = confusion_matrix(y_true, y_pred, normalize='all')


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, normalize='all')
sns.heatmap(conf_matrix, annot=True);

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 5. Capítulo 1