Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Regression with Linear Regression | Section
Machine Learning with PySpark

Regression with Linear Regression

Swipe to show menu

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset

123456789101112131415161718192021222324252627
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.regression import LinearRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LinearRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) # Using ARRIVAL_DELAY directly as a continuous label flights_df = flights_df \ .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model

12345678910111213
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df) predictions = model.transform(test_df) predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients

1234
# Extracting the trained LinearRegression stage lr_model = model.stages[-1] print("Coefficients:", lr_model.coefficients) print("Intercept:", lr_model.intercept)
question mark

What does the regParam parameter control in LinearRegression?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 6

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 6
some-alt