Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Regression with Linear Regression | Section
Machine Learning with PySpark

Regression with Linear Regression

Pyyhkäise näyttääksesi valikon

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset

123456789101112131415161718192021222324252627
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.regression import LinearRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LinearRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) # Using ARRIVAL_DELAY directly as a continuous label flights_df = flights_df \ .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model

12345678910111213
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df) predictions = model.transform(test_df) predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients

1234
# Extracting the trained LinearRegression stage lr_model = model.stages[-1] print("Coefficients:", lr_model.coefficients) print("Intercept:", lr_model.intercept)
question mark

What does the regParam parameter control in LinearRegression?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 6

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 6
some-alt