Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Classification with Logistic Regression | Section
Machine Learning with PySpark

Classification with Logistic Regression

Sveip for å vise menyen

Logistic Regression is the standard baseline for binary classification. Despite the name, it is a classification algorithm – it outputs a probability between 0 and 1 and classifies each row based on a threshold (default 0.5).

Building the Feature Pipeline

1234567891011121314151617181920212223242526
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.classification import LogisticRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LogisticRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model

1234567891011
# Defining pipeline stages indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LogisticRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df)

Generating Predictions

1234
predictions = model.transform(test_df) # Showing label, probability, and prediction for each row predictions.select("LABEL", "probability", "prediction").show(5)

The probability column is a vector of two values – the probability of class 0 and class 1. The prediction column is the final binary output.

question mark

What does the probability column in logistic regression predictions contain?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 2
some-alt