Classification with Logistic Regression
Stryg for at vise menuen
Logistic Regression is the standard baseline for binary classification. Despite the name, it is a classification algorithm – it outputs a probability between 0 and 1 and classifies each row based on a threshold (default 0.5).
Building the Feature Pipeline
1234567891011121314151617181920212223242526import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler from pyspark.ml.classification import LogisticRegression urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("LogisticRegression") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)
Training the Model
1234567891011# Defining pipeline stages indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) lr = LogisticRegression(featuresCol="FEATURES", labelCol="LABEL", maxIter=10) pipeline = Pipeline(stages=[indexer, assembler, scaler, lr]) model = pipeline.fit(train_df)
Generating Predictions
1234predictions = model.transform(test_df) # Showing label, probability, and prediction for each row predictions.select("LABEL", "probability", "prediction").show(5)
The probability column is a vector of two values – the probability of class 0 and class 1. The prediction column is the final binary output.
Var alt klart?
Tak for dine kommentarer!
Sektion 1. Kapitel 2
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Sektion 1. Kapitel 2