Course Content
Linear Regression for ML
Linear Regression for ML
Building the Linear Regression with scikit-learn
You already know what Simple Linear Regression is and how to find the line that fits the data best. Let's go through all the steps of building a linear regression for a real dataset.
Loading data and looking at it
We have a file, simple_height_data.csv
, with the data from our examples. Let's load the file and take a look at it.
import pandas as pd file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Read the file print(df.head()) # Print the first 5 instances from a dataset
So the dataset has two columns: 'Height' - our target, and 'Father', the father's height. That is our feature.
Let's assign our target values to the y
variable and feature values to X
and build a scatterplot.
import pandas as pd import matplotlib.pyplot as plt file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Read the file X = df['Father'] # Assign the feature y = df['Height'] # Assign the target plt.scatter(X,y) # Build scatterplot
Now that we got acquainted with our data let's build a model!
Building a Linear Regression
Building a Linear Regression model with scikit-learn is quite simple!
There is a LinearRegression
class for that.
You need to:
1. Initialize the LinearRegression
class.
2. Train the model with a training set.
3. Now you can predict new instances.
Before putting it all together, there is one more thing to figure out.
Both .fit()
and .predict()
methods of the LinearRegression
class expect X
(or X_new
) to be a 2-D array (or pandas DataFrame).
Choosing a single column from a DataFrame (df['col_name']
) returns a pandas Series, which is not what .fit()
or .predict()
expects, so the following error will be raised:ValueError: Expected 2D array, got 1D array instead
To avoid it, we need to select a single column like this:
Now let's build a Linear Regression and predict new values!
import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression # Import LinearRegression file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Read the file X = df[['Father']] # Assign the feature (with double square brackets) y = df['Height'] # Assign the target (no need in double square brackets for target) model = LinearRegression() # Initialize a model model.fit(X, y) # Train a model X_new = np.array([ [61], [64], [67] ]) # Creating a 2-D array of new instances print(model.predict(X_new)) # Predict a target for new instances
Thanks for your feedback!