Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende What Is Feature Engineering and Why It Matters | Section
Feature Engineering with PySpark

What Is Feature Engineering and Why It Matters

Desliza para mostrar el menú

Machine learning models do not work on raw data directly – they work on numbers. Feature engineering is the process of transforming raw columns into a representation that a model can learn from effectively.

For the flights dataset, raw columns like AIRLINE, SCHEDULED_DEPARTURE, or DISTANCE are useful, but a model cannot use a string like "AA" or an integer like 930 (meaning 09:30) without transformation. Feature engineering bridges that gap.

What Feature Engineering Involves

  • Encoding categoricals – converting string columns like AIRLINE into numeric indices or binary vectors;
  • Scaling numerics – normalizing columns like DISTANCE and DEPARTURE_DELAY so they contribute equally to the model;
  • Extracting information – deriving new columns from existing ones, such as extracting DEPARTURE_HOUR from SCHEDULED_DEPARTURE;
  • Handling text – tokenizing and vectorizing free-text fields;
  • Handling nulls – filling or dropping missing values before passing data to a model.

Why It Matters

The quality of features has more impact on model performance than the choice of algorithm. A well-engineered dataset with a simple model often outperforms a poorly prepared dataset with a complex one.

In PySpark, feature engineering is done using the pyspark.ml.feature module, which provides transformers that integrate cleanly into ML pipelines. Each transformer follows the same pattern: fit() on training data to learn statistics, then transform() to apply the transformation.

1234567891011121314151617
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FeatureEngineering") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) flights_df.select("AIRLINE", "DISTANCE", "DEPARTURE_DELAY", "ARRIVAL_DELAY").show(5)
question mark

What is the primary goal of feature engineering?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 1

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 1
some-alt