What Is Feature Engineering and Why It Matters
メニューを表示するにはスワイプしてください
Machine learning models do not work on raw data directly – they work on numbers. Feature engineering is the process of transforming raw columns into a representation that a model can learn from effectively.
For the flights dataset, raw columns like AIRLINE, SCHEDULED_DEPARTURE, or DISTANCE are useful, but a model cannot use a string like "AA" or an integer like 930 (meaning 09:30) without transformation. Feature engineering bridges that gap.
What Feature Engineering Involves
- Encoding categoricals – converting string columns like
AIRLINEinto numeric indices or binary vectors; - Scaling numerics – normalizing columns like
DISTANCEandDEPARTURE_DELAYso they contribute equally to the model; - Extracting information – deriving new columns from existing ones, such as extracting
DEPARTURE_HOURfromSCHEDULED_DEPARTURE; - Handling text – tokenizing and vectorizing free-text fields;
- Handling nulls – filling or dropping missing values before passing data to a model.
Why It Matters
The quality of features has more impact on model performance than the choice of algorithm. A well-engineered dataset with a simple model often outperforms a poorly prepared dataset with a complex one.
In PySpark, feature engineering is done using the pyspark.ml.feature module, which provides transformers that integrate cleanly into ML pipelines. Each transformer follows the same pattern: fit() on training data to learn statistics, then transform() to apply the transformation.
1234567891011121314151617import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FeatureEngineering") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) flights_df.select("AIRLINE", "DISTANCE", "DEPARTURE_DELAY", "ARRIVAL_DELAY").show(5)
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください