Regression with Linear Regression
メニューを表示するにはスワイプしてください
Instead of predicting whether a flight is delayed, you can predict the exact number of minutes it will be delayed. This is a regression task – the label is a continuous value.
Setting Up the Regression Dataset
123456789101112131415161718192021222324252627import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.regression import LinearRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LinearRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) # Using ARRIVAL_DELAY directly as a continuous label flights_df = flights_df \ .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)
Training the Model
12345678910111213indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df) predictions = model.transform(test_df) predictions.select("LABEL", "prediction").show(5)
regParam is the regularization parameter – it penalizes large coefficients to prevent overfitting. Higher values produce a simpler model.
Inspecting Model Coefficients
1234# Extracting the trained LinearRegression stage lr_model = model.stages[-1] print("Coefficients:", lr_model.coefficients) print("Intercept:", lr_model.intercept)
すべて明確でしたか?
フィードバックありがとうございます!
セクション 1. 章 6
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください
セクション 1. 章 6