Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Working with Text Data in PySpark | Section
Feature Engineering with PySpark

Working with Text Data in PySpark

Свайпніть щоб показати меню

Text columns require their own preprocessing pipeline before they can be used in a model. Raw strings need to be split into tokens, cleaned, and eventually converted into numeric vectors.

The flights dataset does not contain free-text fields, but CANCELLATION_REASON is a categorical code. To demonstrate text processing, you will work with a small synthetic dataset of flight status descriptions alongside the main dataset.

Creating a Text DataFrame

1234567891011121314151617181920212223
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TextData") \ .master("local[*]") \ .getOrCreate() # Synthetic flight status descriptions text_df = spark.createDataFrame([ (1, "flight delayed due to bad weather conditions"), (2, "technical issue caused significant departure delay"), (3, "flight cancelled due to severe weather"), (4, "late aircraft arrival caused departure delay"), (5, "air traffic control delay affected departure time"), ], ["id", "description"]) text_df.show(truncate=False)

Tokenization

Tokenizer splits a string into lowercase tokens by whitespace:

123456
from pyspark.ml.feature import Tokenizer tokenizer = Tokenizer(inputCol="description", outputCol="tokens") tokenized_df = tokenizer.transform(text_df) tokenized_df.select("description", "tokens").show(truncate=False)

Removing Stop Words

Common words like "due", "to", and "a" carry little meaning. StopWordsRemover filters them out:

123456
from pyspark.ml.feature import StopWordsRemover remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens") filtered_df = remover.transform(tokenized_df) filtered_df.select("tokens", "filtered_tokens").show(truncate=False)
question mark

What does StopWordsRemover do to a list of tokens?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 5

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 5
some-alt