Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Tokenization and TF-IDF with MLlib | Section
Feature Engineering with PySpark

Tokenization and TF-IDF with MLlib

Stryg for at vise menuen

After tokenization and stop word removal, you need to convert token lists into numeric vectors. TF-IDF (Term Frequency–Inverse Document Frequency) is the standard approach: it weights each word by how often it appears in a document relative to how common it is across all documents.

HashingTF

HashingTF maps each token to a fixed-size vector using a hashing trick. Each position in the vector represents a hash bucket, and the value is the term frequency:

1234567891011121314151617181920212223242526
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF spark = SparkSession.builder \ .appName("TFIDF") \ .master("local[*]") \ .getOrCreate() text_df = spark.createDataFrame([ (1, "flight delayed due to bad weather conditions"), (2, "technical issue caused significant departure delay"), (3, "flight cancelled due to severe weather"), (4, "late aircraft arrival caused departure delay"), (5, "air traffic control delay affected departure time"), ], ["id", "description"]) # Tokenizing and removing stop words tokenized_df = Tokenizer(inputCol="description", outputCol="tokens").transform(text_df) filtered_df = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens").transform(tokenized_df) # Computing term frequencies hashing_tf = HashingTF(inputCol="filtered_tokens", outputCol="raw_features", numFeatures=20) tf_df = hashing_tf.transform(filtered_df) tf_df.select("filtered_tokens", "raw_features").show(truncate=False)

IDF

IDF down-weights terms that appear in many documents - common words get a lower score even if they appear frequently in one document:

123456
# Fitting IDF on the full corpus and transforming idf = IDF(inputCol="raw_features", outputCol="tfidf_features") idf_model = idf.fit(tf_df) tfidf_df = idf_model.transform(tf_df) tfidf_df.select("filtered_tokens", "tfidf_features").show(truncate=False)

The resulting tfidf_features column is a sparse vector – each document is now represented as a numeric feature vector ready for model input.

question mark

What does the IDF component in TF-IDF do?

Vælg det korrekte svar

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 6

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 6
some-alt