Clustering with K-Means
Swipe to show menu
Clustering is unsupervised – there is no label. The algorithm finds natural groupings in the data based on feature similarity. K-Means partitions rows into k clusters by minimizing the distance between each point and its cluster center.
Preparing Data for Clustering
For clustering, you group airports by their delay patterns – average departure delay, average arrival delay, and total flight volume:
12345678910111213141516171819202122232425262728import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, avg, count, round from pyspark.ml.feature import VectorAssembler, StandardScaler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("KMeansClustering") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Aggregating per airport airport_df = flights_df.groupBy("ORIGIN_AIRPORT").agg( round(avg("DEPARTURE_DELAY"), 2).alias("AVG_DEP_DELAY"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_ARR_DELAY"), count("*").alias("TOTAL_FLIGHTS") ).filter(col("TOTAL_FLIGHTS") > 100) airport_df.show(5)
Training K-Means
123456789101112assembler = VectorAssembler( inputCols=["AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS"], outputCol="FEATURES_RAW" ) scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True) kmeans = KMeans(featuresCol="FEATURES", k=3, seed=42, maxIter=20) pipeline = Pipeline(stages=[assembler, scaler, kmeans]) model = pipeline.fit(airport_df) clustered_df = model.transform(airport_df) clustered_df.select("ORIGIN_AIRPORT", "AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS", "prediction").show(10)
Inspecting Cluster Centers
12345# Extracting cluster centers from the KMeans model stage kmeans_model = model.stages[-1] centers = kmeans_model.clusterCenters() for i, center in enumerate(centers): print(f"Cluster {i}: {center}")
Everything was clear?
Thanks for your feedback!
Section 1. Chapter 8
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Section 1. Chapter 8