Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Evaluating Regression Models | Section
Machine Learning with PySpark

Evaluating Regression Models

Swipe um das Menü anzuzeigen

Classification metrics like accuracy and AUC do not apply to regression. Instead, you measure how far predictions are from the true values using error-based metrics.

Key Metrics

  • RMSE (Root Mean Squared Error) – square root of the average squared error. Penalizes large errors more heavily. In the same units as the label (minutes);
  • MAE (Mean Absolute Error) – average absolute difference between predicted and actual values. More robust to outliers than RMSE;
  • R² (R-squared) – proportion of variance in the label explained by the model. Ranges from 0 to 1 – higher is better. A value of 1 means perfect prediction.

Evaluating with MLlib

123456789101112131415161718192021222324252627282930313233343536373839404142434445
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RegressionEval") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", col("ARRIVAL_DELAY").cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LinearRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10, regParam=0.1) model = Pipeline(stages=[indexer, assembler, scaler, lr]).fit(train_df) predictions = model.transform(test_df) # Evaluating with multiple metrics evaluator = RegressionEvaluator(labelCol="LABEL", predictionCol="prediction") for metric in ["rmse", "mae", "r2"]: evaluator.setMetricName(metric) print(f"{metric.upper()}: {evaluator.evaluate(predictions):.4f}")
question mark

What does an R² value of 0.85 mean for a regression model?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 7

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 7
some-alt