Decision Trees and Random Forests
Veeg om het menu te tonen
Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.
Decision Tree Classifier
123456789101112131415161718192021222324252627282930313233343536373839import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TreeModels") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5) pipeline = Pipeline(stages=[indexer, assembler, dt]) dt_model = pipeline.fit(train_df) predictions = dt_model.transform(test_df) predictions.select("LABEL", "prediction").show(5)
maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.
Random Forest Classifier
1234567891011121314151617from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42 ) pipeline = Pipeline(stages=[indexer, assembler, rf]) rf_model = pipeline.fit(train_df) predictions = rf_model.transform(test_df) # Inspecting feature importances rf_stage = rf_model.stages[-1] print(rf_stage.featureImportances)
featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.
Was alles duidelijk?
Bedankt voor je feedback!
Sectie 1. Hoofdstuk 3
Vraag AI
Vraag AI
Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.
Sectie 1. Hoofdstuk 3