Impara Regression with Linear Regression

Scorri per mostrare il menu

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset


              123456789101112131415161718192021222324252627
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, floor
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("LinearRegression") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"])

# Using ARRIVAL_DELAY directly as a continuous label
flights_df = flights_df \
    .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \
    .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \
    .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer"))

train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model


              12345678910111213
            
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
assembler = VectorAssembler(
    inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"],
    outputCol="FEATURES_RAW"
)
scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True)
lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1)

pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients


              1234
            
# Extracting the trained LinearRegression stage
lr_model = model.stages[-1]
print("Coefficients:", lr_model.coefficients)
print("Intercept:", lr_model.intercept)

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 6

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 6