Lære Classification with Logistic Regression

Sveip for å vise menyen

Logistic Regression is the standard baseline for binary classification. Despite the name, it is a classification algorithm – it outputs a probability between 0 and 1 and classifies each row based on a threshold (default 0.5).

Building the Feature Pipeline


              1234567891011121314151617181920212223242526
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, floor
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("LogisticRegression") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"])

flights_df = flights_df \
    .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \
    .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \
    .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer"))

train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

Training the Model


              1234567891011
            
# Defining pipeline stages
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
assembler = VectorAssembler(
    inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"],
    outputCol="FEATURES_RAW"
)
scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True)
lr = LogisticRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10)

pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])
model = pipeline.fit(train_df)

Generating Predictions


              1234
            
predictions = model.transform(test_df)

# Showing label, probability, and prediction for each row
predictions.select("LABEL", "probability", "prediction").show(5)

The probability column is a vector of two values – the probability of class 0 and class 1. The prediction column is the final binary output.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 2