Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Regression with Linear Regression | Section
Machine Learning with PySpark

Regression with Linear Regression

Sveip for å vise menyen

Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.

Setting Up the Regression Dataset

123456789101112131415161718192021222324252627
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.regression import LinearRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LinearRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) # Using ARRIVAL_DELAY directly as a continuous label flights_df = flights_df \ .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model

12345678910111213
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df) predictions = model.transform(test_df) predictions.select("LABEL", "prediction").show(5)

regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.

Inspecting Model Coefficients

1234
# Extracting the trained LinearRegression stage lr_model = model.stages[-1] print("Coefficients:", lr_model.coefficients) print("Intercept:", lr_model.intercept)
question mark

What does the regParam parameter control in LinearRegression?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 6

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 6
some-alt