Lernen Regression with Linear Regression

Swipe um das Menü anzuzeigen

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset


              123456789101112131415161718192021222324252627
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, floor
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("LinearRegression") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"])

# Using ARRIVAL_DELAY directly as a continuous label
flights_df = flights_df \
    .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \
    .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \
    .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer"))

train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model


              12345678910111213
            
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
assembler = VectorAssembler(
    inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"],
    outputCol="FEATURES_RAW"
)
scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True)
lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1)

pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients


              1234
            
# Extracting the trained LinearRegression stage
lr_model = model.stages[-1]
print("Coefficients:", lr_model.coefficients)
print("Intercept:", lr_model.intercept)

War alles klar?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 6

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 6