Зміст курсу
Introduction to NLP
Introduction to NLP
Bag of Words
Understanding the BoW Model
The bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.
Let's take a look at how the same sentence (document) is represented by each type:
A binary model represents this document as the [1, 1, 1] vector, while frequency-based model represents it as [2, 1, 2], taking word frequency into account.
BoW Implementation
Implementing the BoW model is a straightforward process, especially with the help of the sklearn
(scikit-learn) library and its CountVectorizer
class.
Here is an implementation of binary bag of words model:
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a binary Bag of Words model vectorizer = CountVectorizer(binary=True) # Generate a BoW matrix bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a dense array print(bow_matrix.toarray())
Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray()
method.
A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing the non-zero elements.
In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True
since the default value for it is False
:
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a frequency-based Bag of Words model vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
Converting the Matrix to a DataFrame
It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame
. Moreover, the CountVectorizer
instance offers the get_feature_names_out()
method, which retrieves an array of unique words (feature names) used in the model. These feature names can then be used as the columns of the DataFrame
:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix to a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
With this representation, we can now easily access not only the vector for a particular document, but also the vector of a particular word:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) # Print the vector for 'global' as a NumPy array print(f"Vector for the word 'global': {bow_df['global'].values}")
Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame
by specifying the word (for example, 'global'
). We also use the values
attribute to obtain an array instead of a Series
as the result.
Дякуємо за ваш відгук!