Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Bag of Words | Basic Text Models
Introduction to NLP
course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Bag of Words

Understanding the BoW Model

The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Implementing the BoW model is a straightforward process, especially with the help of the sklearn (scikit-learn) library and its CountVectorizer class.

Here is an implementation of binary bag of words model:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
copy

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

Note
Study More

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

1234567891011
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
copy

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
copy

With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
copy

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Click or drag`n`drop items and fill in the blanks

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 3

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Bag of Words

Understanding the BoW Model

The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Implementing the BoW model is a straightforward process, especially with the help of the sklearn (scikit-learn) library and its CountVectorizer class.

Here is an implementation of binary bag of words model:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
copy

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

Note
Study More

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

1234567891011
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
copy

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
copy

With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
copy

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Click or drag`n`drop items and fill in the blanks

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 3
some-alt