Decision Trees and Random Forests
Свайпніть щоб показати меню
Decision Trees split the data recursively based on feature thresholds, forming a tree of if-else rules. They are interpretable but prone to overfitting. Random Forests address this by training many trees on random subsets of the data and features, then averaging their predictions.
Decision Tree Classifier
123456789101112131415161718192021222324252627282930313233343536373839import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TreeModels") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("LABEL", (col("ARRIVAL_DELAY") > 15).cast("double")) \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) train_df, test_df = flights_df.randomSplit([0.8, 0.2], seed=42) indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND", "AIRLINE_IDX"], outputCol="FEATURES" ) dt = DecisionTreeClassifier(featuresCol="FEATURES", labelCol="LABEL", maxDepth=5) pipeline = Pipeline(stages=[indexer, assembler, dt]) dt_model = pipeline.fit(train_df) predictions = dt_model.transform(test_df) predictions.select("LABEL", "prediction").show(5)
maxDepth controls how deep the tree can grow. Deeper trees fit training data better but overfit more easily.
Random Forest Classifier
1234567891011121314151617from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( featuresCol="FEATURES", labelCol="LABEL", numTrees=20, maxDepth=5, seed=42 ) pipeline = Pipeline(stages=[indexer, assembler, rf]) rf_model = pipeline.fit(train_df) predictions = rf_model.transform(test_df) # Inspecting feature importances rf_stage = rf_model.stages[-1] print(rf_stage.featureImportances)
featureImportances shows how much each feature contributed to the splits across all trees – useful for understanding which columns matter most.
Все було зрозуміло?
Дякуємо за ваш відгук!
Секція 1. Розділ 3
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Секція 1. Розділ 3