Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Challenge: Building a Feature Pipeline for Customer Data | Section
Feature Engineering with PySpark
Seksjon 1. Kapittel 4
single

single

Challenge: Building a Feature Pipeline for Customer Data

Sveip for å vise menyen

Oppgave

Sveip for å begynne å kode

You are given a flights dataset as a list of rows. Load it into a DataFrame using createDataFrame and apply the encoding and scaling techniques from the previous chapters. Store results in the specified variables:

  1. Fill nulls in Delay and Length with 0;
  2. Apply StringIndexer to Airline – store the result in a column AIRLINE_IDX;
  3. Apply OneHotEncoder to AIRLINE_IDX – store the result in a column AIRLINE_VEC;
  4. Assemble Length, Time, and AIRLINE_IDX into a vector column FEATURES_RAW;
  5. Apply StandardScaler with withMean=True and withStd=True to FEATURES_RAW – store the result in FEATURES_SCALED;
  6. Store the final DataFrame in features_df and count its rows in features_count.

Print features_count and show all rows of Airline, AIRLINE_VEC, FEATURES_SCALED.

Løsning

Switch to desktopBytt til skrivebordet for virkelighetspraksisFortsett der du er med et av alternativene nedenfor
Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 4
single

single

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

some-alt