Aprende What Is Feature Engineering and Why It Matters

Desliza para mostrar el menú

Machine learning models do not work on raw data directly – they work on numbers. Feature engineering is the process of transforming raw columns into a representation that a model can learn from effectively.

For the flights dataset, raw columns like AIRLINE, SCHEDULED_DEPARTURE, or DISTANCE are useful, but a model cannot use a string like "AA" or an integer like 930 (meaning 09:30) without transformation. Feature engineering bridges that gap.

What Feature Engineering Involves

Encoding categoricals – converting string columns like AIRLINE into numeric indices or binary vectors;
Scaling numerics – normalizing columns like DISTANCE and DEPARTURE_DELAY so they contribute equally to the model;
Extracting information – deriving new columns from existing ones, such as extracting DEPARTURE_HOUR from SCHEDULED_DEPARTURE;
Handling text – tokenizing and vectorizing free-text fields;
Handling nulls – filling or dropping missing values before passing data to a model.

Why It Matters

The quality of features has more impact on model performance than the choice of algorithm. A well-engineered dataset with a simple model often outperforms a poorly prepared dataset with a complex one.

In PySpark, feature engineering is done using the pyspark.ml.feature module, which provides transformers that integrate cleanly into ML pipelines. Each transformer follows the same pattern: fit() on training data to learn statistics, then transform() to apply the transformation.


              1234567891011121314151617
            
import urllib.request
from pyspark.sql import SparkSession

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FeatureEngineering") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"])

flights_df.select("AIRLINE", "DISTANCE", "DEPARTURE_DELAY", "ARRIVAL_DELAY").show(5)

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 1

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 1