Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Evaluating Classification Models | Section
Machine Learning with PySpark

Evaluating Classification Models

Свайпніть щоб показати меню

Accuracy alone is a poor metric for classification – if 80% of flights are on time, a model that always predicts "on time" achieves 80% accuracy without learning anything. You need metrics that capture both types of errors.

Key Metrics

  • Accuracy – fraction of correct predictions. Misleading for imbalanced classes;
  • Precision – of all flights predicted as delayed, what fraction actually were;
  • Recall – of all flights that were actually delayed, what fraction did the model catch;
  • F1 score – harmonic mean of precision and recall. Balances both;
  • AUC-ROC – area under the ROC curve. Measures the model's ability to distinguish classes regardless of threshold.

Evaluating with MLlib

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("ClassificationEval") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) rf = RandomForestClassifier(featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42) pipeline = Pipeline(stages=[indexer, assembler, rf]) model = pipeline.fit(train_df) predictions = model.transform(test_df) # AUC-ROC binary_evaluator = BinaryClassificationEvaluator(labelCol="LABEL", metricName="areaUnderROC") print(f"AUC-ROC: {binary_evaluator.evaluate(predictions):.4f}") # Accuracy, F1, Precision, Recall multi_evaluator = MulticlassClassificationEvaluator(labelCol="LABEL", predictionCol="prediction") for metric in ["accuracy", "f1", "weightedPrecision", "weightedRecall"]: multi_evaluator.setMetricName(metric) print(f"{metric}: {multi_evaluator.evaluate(predictions):.4f}")
question mark

Why is accuracy a misleading metric for imbalanced classification problems?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 4

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 4
some-alt