Challenge: Classifying Inseparable Data
You will use the following dataset with two features:
1234import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
123456import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
123456789101112131415161718import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV
allows you to find the optimal C
parameter for better accuracy.
This task also uses the Pipeline
class. You can think of it as a sequence of preprocessing steps. Its .fit_transform()
method sequentially applies .fit_transform()
to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame
in the df
variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
X
and be scaled and store the resulting pipeline in thepipe
variable. - Create a
param_grid
dictionary to with values[0.01, 0.1, 1, 10, 100]
of theC
hyperparameter. - Initialize and train a
GridSearchCV
object and store the trained object in thegrid_cv
variable.
Solution
Thanks for your feedback!
single
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 4.17
Challenge: Classifying Inseparable Data
Swipe to show menu
You will use the following dataset with two features:
1234import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
123456import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
123456789101112131415161718import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV
allows you to find the optimal C
parameter for better accuracy.
This task also uses the Pipeline
class. You can think of it as a sequence of preprocessing steps. Its .fit_transform()
method sequentially applies .fit_transform()
to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame
in the df
variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
X
and be scaled and store the resulting pipeline in thepipe
variable. - Create a
param_grid
dictionary to with values[0.01, 0.1, 1, 10, 100]
of theC
hyperparameter. - Initialize and train a
GridSearchCV
object and store the trained object in thegrid_cv
variable.
Solution
Thanks for your feedback!
Awesome!
Completion rate improved to 4.17single