Learn Clustering with K-Means

Swipe to show menu

Clustering is unsupervised – there is no label. The algorithm finds natural groupings in the data based on feature similarity. K-Means partitions rows into k clusters by minimizing the distance between each point and its cluster center.

Preparing Data for Clustering

For clustering, you group airports by their delay patterns – average departure delay, average arrival delay, and total flight volume:


              12345678910111213141516171819202122232425262728
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, round
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("KMeansClustering") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"])

# Aggregating per airport
airport_df = flights_df.groupBy("ORIGIN_AIRPORT").agg(
    round(avg("DEPARTURE_DELAY"), 2).alias("AVG_DEP_DELAY"),
    round(avg("ARRIVAL_DELAY"), 2).alias("AVG_ARR_DELAY"),
    count("*").alias("TOTAL_FLIGHTS")
).filter(col("TOTAL_FLIGHTS") > 100)

airport_df.show(5)

Training K-Means


              123456789101112
            
assembler = VectorAssembler(
    inputCols=["AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS"],
    outputCol="FEATURES_RAW"
)
scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES", withMean=True, withStd=True)
kmeans = KMeans(featuresCol="FEATURES", k=3, seed=42, maxIter=20)

pipeline = Pipeline(stages=[assembler, scaler, kmeans])
model = pipeline.fit(airport_df)
clustered_df = model.transform(airport_df)

clustered_df.select("ORIGIN_AIRPORT", "AVG_DEP_DELAY", "AVG_ARR_DELAY", "TOTAL_FLIGHTS", "prediction").show(10)

Inspecting Cluster Centers


              12345
            
# Extracting cluster centers from the KMeans model stage
kmeans_model = model.stages[-1]
centers = kmeans_model.clusterCenters()
for i, center in enumerate(centers):
    print(f"Cluster {i}: {center}")

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 8

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 8