Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Bag of Words | Basic Text Models
Introduction to NLP

bookBag of Words

Understanding the BoW Model

The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Implementing the BoW model is a straightforward process, especially with the help of the sklearn (scikit-learn) library and its CountVectorizer class.

Here is an implementation of binary bag of words model:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
copy

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

Note
Study More

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

1234567891011
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
copy

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
copy

With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
copy

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Pergunte-me perguntas sobre este assunto

Resumir este capítulo

Mostrar exemplos do mundo real

Awesome!

Completion rate improved to 3.45

bookBag of Words

Deslize para mostrar o menu

Understanding the BoW Model

The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Implementing the BoW model is a straightforward process, especially with the help of the sklearn (scikit-learn) library and its CountVectorizer class.

Here is an implementation of binary bag of words model:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
copy

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

Note
Study More

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

1234567891011
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
copy

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
copy

With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:

12345678910111213
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
copy

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3
some-alt