Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Decision Trees and Random Forests | Section
Machine Learning with PySpark

Decision Trees and Random Forests

Sveip for å vise menyen

Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.

Decision Tree Classifier

123456789101112131415161718192021222324252627282930313233343536373839
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TreeModels") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5) pipeline = Pipeline(stages=[indexer, assembler, dt]) dt_model = pipeline.fit(train_df) predictions = dt_model.transform(test_df) predictions.select("LABEL", "prediction").show(5)

maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.

Random Forest Classifier

1234567891011121314151617
from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42 ) pipeline = Pipeline(stages=[indexer, assembler, rf]) rf_model = pipeline.fit(train_df) predictions = rf_model.transform(test_df) # Inspecting feature importances rf_stage = rf_model.stages[-1] print(rf_stage.featureImportances)

featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.

question mark

Why do Random Forests generally outperform a single Decision Tree?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 3

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 3
some-alt