Learn Decision Trees and Random Forests

Swipe to show menu

Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.

Decision Tree Classifier


              123456789101112131415161718192021222324252627282930313233343536373839
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, floor
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("TreeModels") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"])

flights_df = flights_df \
    .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \
    .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \
    .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer"))

train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42)

indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
assembler = VectorAssembler(
    inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"],
    outputCol="FEATURES"
)

dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5)
pipeline = Pipeline(stages=[indexer, assembler, dt])
dt_model = pipeline.fit(train_df)

predictions = dt_model.transform(test_df)
predictions.select("LABEL", "prediction").show(5)

maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.

Random Forest Classifier


              1234567891011121314151617
            
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(
    featuresCol="FEATURES",
    labelCol="LABEL",
    numTrees=20,
    maxDepth=5,
    seed=42
)

pipeline = Pipeline(stages=[indexer, assembler, rf])
rf_model = pipeline.fit(train_df)
predictions = rf_model.transform(test_df)

# Inspecting feature importances
rf_stage = rf_model.stages[-1]
print(rf_stage.featureImportances)

featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 3