Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Stop Words | Text Preprocessing Fundamentals
Introduction to NLP
course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Stop Words

Understanding Stop Words

Note
Definition

Stop words are common words that usually do not contribute to the meaning of a sentence, at least for the purposes of most analysis and algorithms. These include words such as "the", "is", "in", and "on".

Stop words are typically filtered out after tokenization for NLP tasks, such as sentiment analysis, topic modeling, or keyword extraction. The rationale behind removing stop words is to decrease the dataset size, thereby improving computational efficiency, and to increase the relevance of the analysis by focusing on the words that carry significant meaning.

Removing Stop Words with NLTK

To make things easier, nltk provides a comprehensive list of stop words in multiple languages, which can be easily accessed and used to filter stop words from text data.

Here's how you can get the list of English stop words in NLTK and convert it to set:

1234567
import nltk from nltk.corpus import stopwords # Download the stop words list nltk.download('stopwords') # Load English stop words stop_words = set(stopwords.words('english')) print(stop_words)
copy
Note
Note

Converting this list to a set improves the efficiency of lookups, as checking membership in a set is faster than in a list.

With this in mind, let's take a look at a complete example of how to filter out stop words from a given text:

1234567891011121314
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) text = "This is an example sentence demonstrating the removal of stop words." text = text.lower() # Tokenize the text tokens = word_tokenize(text) # Remove stop words filtered_tokens = [word for word in tokens if word not in stop_words] print("Original Tokens:", tokens) print("Filtered Tokens:", filtered_tokens)
copy

As you can see, we should first download the stop words and perform tokenization. The next step is to use a list comprehension to create a list containing only tokens which are not stop words. The word.lower() in the if clause is essential to convert each word (token) to lower case, since nltk contains stop words exclusively in lower case.

Note
Note

Alternatively, we could use a usual for loop instead of a list comprehension, however, using list comprehension here is more efficient and concise.

question mark

Which of the following best describes what stop words are and why they are often removed?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 7

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Stop Words

Understanding Stop Words

Note
Definition

Stop words are common words that usually do not contribute to the meaning of a sentence, at least for the purposes of most analysis and algorithms. These include words such as "the", "is", "in", and "on".

Stop words are typically filtered out after tokenization for NLP tasks, such as sentiment analysis, topic modeling, or keyword extraction. The rationale behind removing stop words is to decrease the dataset size, thereby improving computational efficiency, and to increase the relevance of the analysis by focusing on the words that carry significant meaning.

Removing Stop Words with NLTK

To make things easier, nltk provides a comprehensive list of stop words in multiple languages, which can be easily accessed and used to filter stop words from text data.

Here's how you can get the list of English stop words in NLTK and convert it to set:

1234567
import nltk from nltk.corpus import stopwords # Download the stop words list nltk.download('stopwords') # Load English stop words stop_words = set(stopwords.words('english')) print(stop_words)
copy
Note
Note

Converting this list to a set improves the efficiency of lookups, as checking membership in a set is faster than in a list.

With this in mind, let's take a look at a complete example of how to filter out stop words from a given text:

1234567891011121314
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) text = "This is an example sentence demonstrating the removal of stop words." text = text.lower() # Tokenize the text tokens = word_tokenize(text) # Remove stop words filtered_tokens = [word for word in tokens if word not in stop_words] print("Original Tokens:", tokens) print("Filtered Tokens:", filtered_tokens)
copy

As you can see, we should first download the stop words and perform tokenization. The next step is to use a list comprehension to create a list containing only tokens which are not stop words. The word.lower() in the if clause is essential to convert each word (token) to lower case, since nltk contains stop words exclusively in lower case.

Note
Note

Alternatively, we could use a usual for loop instead of a list comprehension, however, using list comprehension here is more efficient and concise.

question mark

Which of the following best describes what stop words are and why they are often removed?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 7
some-alt