Decision Trees and Random Forests
Svep för att visa menyn
Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.
Decision Tree Classifier
123456789101112131415161718192021222324252627282930313233343536373839import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TreeModels") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5) pipeline = Pipeline(stages=[indexer, assembler, dt]) dt_model = pipeline.fit(train_df) predictions = dt_model.transform(test_df) predictions.select("LABEL", "prediction").show(5)
maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.
Random Forest Classifier
1234567891011121314151617from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42 ) pipeline = Pipeline(stages=[indexer, assembler, rf]) rf_model = pipeline.fit(train_df) predictions = rf_model.transform(test_df) # Inspecting feature importances rf_stage = rf_model.stages[-1] print(rf_stage.featureImportances)
featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.
Var allt tydligt?
Tack för dina kommentarer!
Avsnitt 1. Kapitel 3
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Avsnitt 1. Kapitel 3