Lære Regression with Linear Regression

Sveip for å vise menyen

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset


              123456789101112131415161718192021222324252627
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, floor
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("LinearRegression") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"])

# Using ARRIVAL_DELAY directly as a continuous label
flights_df = flights_df \
    .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \
    .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \
    .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer"))

train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model


              12345678910111213
            
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
assembler = VectorAssembler(
    inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"],
    outputCol="FEATURES_RAW"
)
scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True)
lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1)

pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients


              1234
            
# Extracting the trained LinearRegression stage
lr_model = model.stages[-1]
print("Coefficients:", lr_model.coefficients)
print("Intercept:", lr_model.intercept)

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 6

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 6