Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen What Is Feature Engineering and Why It Matters | Section
Feature Engineering with PySpark

What Is Feature Engineering and Why It Matters

Swipe um das Menü anzuzeigen

Machine learning models do not work on raw data directly – they work on numbers. Feature engineering is the process of transforming raw columns into a representation that a model can learn from effectively.

For the flights dataset, raw columns like AIRLINE, SCHEDULED_DEPARTURE, or DISTANCE are useful, but a model cannot use a string like "AA" or an integer like 930 (meaning 09:30) without transformation. Feature engineering bridges that gap.

What Feature Engineering Involves

  • Encoding categoricals – converting string columns like AIRLINE into numeric indices or binary vectors;
  • Scaling numerics – normalizing columns like DISTANCE and DEPARTURE_DELAY so they contribute equally to the model;
  • Extracting information – deriving new columns from existing ones, such as extracting DEPARTURE_HOUR from SCHEDULED_DEPARTURE;
  • Handling text – tokenizing and vectorizing free-text fields;
  • Handling nulls – filling or dropping missing values before passing data to a model.

Why It Matters

The quality of features has more impact on model performance than the choice of algorithm. A well-engineered dataset with a simple model often outperforms a poorly prepared dataset with a complex one.

In PySpark, feature engineering is done using the pyspark.ml.feature module, which provides transformers that integrate cleanly into ML pipelines. Each transformer follows the same pattern: fit() on training data to learn statistics, then transform() to apply the transformation.

1234567891011121314151617
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FeatureEngineering") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) flights_df.select("AIRLINE", "DISTANCE", "DEPARTURE_DELAY", "ARRIVAL_DELAY").show(5)
question mark

What is the primary goal of feature engineering?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 1
some-alt