Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Challenge: Building a Feature Pipeline for Customer Data | Section
Feature Engineering with PySpark
Section 1. Chapter 4
single

single

Challenge: Building a Feature Pipeline for Customer Data

Swipe to show menu

Task

Swipe to start coding

You are given a flights dataset as a list of rows. Load it into a DataFrame using createDataFrame and apply the encoding and scaling techniques from the previous chapters. Store results in the specified variables:

  1. Fill nulls in Delay and Length with 0;
  2. Apply StringIndexer to Airline – store the result in a column AIRLINE_IDX;
  3. Apply OneHotEncoder to AIRLINE_IDX – store the result in a column AIRLINE_VEC;
  4. Assemble Length, Time, and AIRLINE_IDX into a vector column FEATURES_RAW;
  5. Apply StandardScaler with withMean=True and withStd=True to FEATURES_RAW – store the result in FEATURES_SCALED;
  6. Store the final DataFrame in features_df and count its rows in features_count.

Print features_count and show all rows of Airline, AIRLINE_VEC, FEATURES_SCALED.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 4
single

single

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

some-alt