Bag of Words

Understanding the BoW Model

The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Implementing the BoW model is a straightforward process, especially with the help of the sklearn (scikit-learn) library and its CountVectorizer class.

Here is an implementation of binary bag of words model:


              12345678910111213
            
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
# Create a binary Bag of Words model
vectorizer = CountVectorizer(binary=True)
# Generate a BoW matrix
bow_matrix = vectorizer.fit_transform(corpus)
# Convert a sparse matrix into a dense array
print(bow_matrix.toarray())

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

Study More

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:


              1234567891011
            
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
# Create a frequency-based Bag of Words model
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame:


              12345678910111213
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
# Convert a sparse matrix to a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:


              12345678910111213
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
# Print the vector for 'global' as a NumPy array
print(f"Vector for the word 'global': {bow_df['global'].values}")

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to NLP