Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Stemming | Stemming and Lemmatization
Introduction to NLP
course content

Contenuti del Corso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Stemming

Understanding Stemming

Note
Definition

Stemming is a text normalization technique used in NLP to reduce inflected words to their stem.

To be more precise, stemming involves removing affixes (usually only suffixes) from words to obtain their root form, known as the stem. For example, the stems of "running", "runs", and "run" are all "run."

The purpose of stemming is to simplify the analysis by treating similar words as the same entity, ultimately improving the efficiency in various NLP tasks.

Stemming with NLTK

NLTK provides various stemming algorithms, with the most popular being the Porter Stemmer and the Lancaster Stemmer. These algorithms apply specific rules to strip affixes and derive the stem of a word.

All of the stemmer classes in NLTK share a common interface. First, you have to create an instance of the stemmer class and then use its stem() method for each of the tokens.

1234567891011121314151617181920212223242526272829
import nltk from nltk.stem import PorterStemmer, LancasterStemmer from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) # Create a Porter Stemmer instance porter_stemmer = PorterStemmer() # Create a Lancaster Stemmer instance lancaster_stemmer = LancasterStemmer() text = "Stemming is an essential technique for natural language processing." text = text.lower() tokens = word_tokenize(text) # Filter out the stop words tokens = [token for token in tokens if token.lower() not in stop_words] # Apply stemming to each token porter_stemmed_tokens = [porter_stemmer.stem(token) for token in tokens] lancaster_stemmed_tokens = [lancaster_stemmer.stem(token) for token in tokens] # Display the results print("Original Tokens:", tokens) print("Stemmed Tokens (Porter Stemmer):", porter_stemmed_tokens) print("Stemmed Tokens (Lancaster Stemmer):", lancaster_stemmed_tokens)
copy

First, we applied tokenization, then filtered out the stop words and finally applied stemming on our tokens using list comprehension. Speaking of the results, these two stemmers produced rather different results. This is due to the fact that the Lancaster Stemmer has about twice as many rules as the Porter Stemmer and is one of the most "aggressive" stemmers.

Note
Note

Overall, the Porter Stemmer is the most popular option producing more meaningful results than the Lancaster Stemmer, which tends to overstem words.

question mark

Which of the following statements about the Porter and Lancaster stemmers are true?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 1

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

course content

Contenuti del Corso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Stemming

Understanding Stemming

Note
Definition

Stemming is a text normalization technique used in NLP to reduce inflected words to their stem.

To be more precise, stemming involves removing affixes (usually only suffixes) from words to obtain their root form, known as the stem. For example, the stems of "running", "runs", and "run" are all "run."

The purpose of stemming is to simplify the analysis by treating similar words as the same entity, ultimately improving the efficiency in various NLP tasks.

Stemming with NLTK

NLTK provides various stemming algorithms, with the most popular being the Porter Stemmer and the Lancaster Stemmer. These algorithms apply specific rules to strip affixes and derive the stem of a word.

All of the stemmer classes in NLTK share a common interface. First, you have to create an instance of the stemmer class and then use its stem() method for each of the tokens.

1234567891011121314151617181920212223242526272829
import nltk from nltk.stem import PorterStemmer, LancasterStemmer from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) # Create a Porter Stemmer instance porter_stemmer = PorterStemmer() # Create a Lancaster Stemmer instance lancaster_stemmer = LancasterStemmer() text = "Stemming is an essential technique for natural language processing." text = text.lower() tokens = word_tokenize(text) # Filter out the stop words tokens = [token for token in tokens if token.lower() not in stop_words] # Apply stemming to each token porter_stemmed_tokens = [porter_stemmer.stem(token) for token in tokens] lancaster_stemmed_tokens = [lancaster_stemmer.stem(token) for token in tokens] # Display the results print("Original Tokens:", tokens) print("Stemmed Tokens (Porter Stemmer):", porter_stemmed_tokens) print("Stemmed Tokens (Lancaster Stemmer):", lancaster_stemmed_tokens)
copy

First, we applied tokenization, then filtered out the stop words and finally applied stemming on our tokens using list comprehension. Speaking of the results, these two stemmers produced rather different results. This is due to the fact that the Lancaster Stemmer has about twice as many rules as the Porter Stemmer and is one of the most "aggressive" stemmers.

Note
Note

Overall, the Porter Stemmer is the most popular option producing more meaningful results than the Lancaster Stemmer, which tends to overstem words.

question mark

Which of the following statements about the Porter and Lancaster stemmers are true?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 1
some-alt