Course Content
Classification with Python
Classification with Python
Challenge: Classifying Inseparable Data
You will use the following dataset with two features:
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV
allows you to find the optimal C
parameter for better accuracy.
This task also uses the Pipeline
class. You can think of it as a sequence of preprocessing steps. Its .fit_transform()
method sequentially applies .fit_transform()
to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame
in the df
variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
X
and be scaled and store the resulting pipeline in thepipe
variable. - Create a
param_grid
dictionary to with values[0.01, 0.1, 1, 10, 100]
of theC
hyperparameter. - Initialize and train a
GridSearchCV
object and store the trained object in thegrid_cv
variable.
Solution
Thanks for your feedback!
Challenge: Classifying Inseparable Data
You will use the following dataset with two features:
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV
allows you to find the optimal C
parameter for better accuracy.
This task also uses the Pipeline
class. You can think of it as a sequence of preprocessing steps. Its .fit_transform()
method sequentially applies .fit_transform()
to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame
in the df
variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
X
and be scaled and store the resulting pipeline in thepipe
variable. - Create a
param_grid
dictionary to with values[0.01, 0.1, 1, 10, 100]
of theC
hyperparameter. - Initialize and train a
GridSearchCV
object and store the trained object in thegrid_cv
variable.
Solution
Thanks for your feedback!