Conteúdo do Curso
Introduction to NLP
Introduction to NLP
Bag of Words
Understanding the BoW Model
As we have already mentioned in the previous chapter, the bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1
if present, 0
if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.
Let's take a look at how the same sentence (document) is represented by each type:
As you can see, a binary model represents this document as the [1, 1, 1] vector, while frequency-based models represent it as [2, 1, 2], taking word frequency into account.
BoW Implementation
Let's now delve into the BoW model implementation in Python. Implementing the BoW model can be a straightforward process, especially with the help of the sklearn
(Scikit-learn) library and its CountVectorizer
class.
Without further ado, let's proceed with an example of a binary bag of words:
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray()
method.
In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True
since the default value for it is False
:
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
Converting the Matrix to a DataFrame
It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame
. Moreover, the CountVectorizer
instance offers the get_feature_names_out()
method, which retrieves an array of unique words (feature names) used in the model. These feature names can set as the columns of the resulting DataFrame
, here is an example:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
With this representation, we can now easily access not only the vector for a particular document, but the vector of a particular word:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame
by specifying the word (for example, 'global'
). We also use the values
attribute to obtain an array instead of a Series
as the result.
Obrigado pelo seu feedback!