Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Clustering with K-Means | Section
Machine Learning with PySpark

Clustering with K-Means

Stryg for at vise menuen

Clustering is unsupervised – there is no label. The algorithm finds natural groupings in the data based on feature similarity. K-Means partitions rows into k clusters by minimizing the distance between each point and its cluster center.

Preparing Data for Clustering

For clustering, you group airports by their delay patterns – average departure delay, average arrival delay, and total flight volume:

12345678910111213141516171819202122232425262728
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, avg, count, round from pyspark.ml.feature import VectorAssembler, StandardScaler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("KMeansClustering") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Aggregating per airport airport_df = flights_df.groupBy("ORIGIN_AIRPORT").agg( round(avg("DEPARTURE_DELAY"), 2).alias("AVG_DEP_DELAY"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_ARR_DELAY"), count("*").alias("TOTAL_FLIGHTS") ).filter(col("TOTAL_FLIGHTS") > 100) airport_df.show(5)

Training K-Means

123456789101112
assembler = VectorAssembler( inputCols=["AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) kmeans = KMeans(featuresCol="FEATURES", k=3, seed=42, maxIter=20) pipeline = Pipeline(stages=[assembler, scaler, kmeans]) model = pipeline.fit(airport_df) clustered_df = model.transform(airport_df) clustered_df.select("ORIGIN_AIRPORT", "AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS", "prediction").show(10)

Inspecting Cluster Centers

12345
# Extracting cluster centers from the KMeans model stage kmeans_model = model.stages[-1] centers = kmeans_model.clusterCenters() for i, center in enumerate(centers): print(f"Cluster {i}: {center}")
question mark

What does K-Means minimize when assigning points to clusters?

Vælg det korrekte svar

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 8

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 8
some-alt