Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Decision Trees and Random Forests | Section
Machine Learning with PySpark

Decision Trees and Random Forests

Swipe to show menu

Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.

Decision Tree Classifier

123456789101112131415161718192021222324252627282930313233343536373839
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TreeModels") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5) pipeline = Pipeline(stages=[indexer, assembler, dt]) dt_model = pipeline.fit(train_df) predictions = dt_model.transform(test_df) predictions.select("LABEL", "prediction").show(5)

maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.

Random Forest Classifier

1234567891011121314151617
from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42 ) pipeline = Pipeline(stages=[indexer, assembler, rf]) rf_model = pipeline.fit(train_df) predictions = rf_model.transform(test_df) # Inspecting feature importances rf_stage = rf_model.stages[-1] print(rf_stage.featureImportances)

featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.

question mark

Why do Random Forests generally outperform a single Decision Tree?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 3
some-alt