Contenido del Curso
Introduction to NLP
Introduction to NLP
Customizing Bag of Words
The Bag of Words model, particularly its implementation through the CountVectorizer
class, offers several parameters for customization. This allows it to be tailored to the specific needs of various text analysis tasks, significantly enhancing the model's effectiveness.
Minimum and Maximum Document Frequency
The min_df
parameter defines the minimum number of documents a term must appear in to be included in the vocabulary, either as an absolute number or a proportion. It helps exclude rare terms, which are often less informative.
Similarly, max_df
determines the maximum frequency a term can have across documents to remain in the vocabulary, also specifiable as an absolute number or proportion. It filters out overly common terms that don't contribute to distinguishing between documents.
Let's take a look at an example:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ "The quick brown fox jumps over the lazy dog.", "Quick brown foxes leap over lazy dogs in summer.", "The quick brown fox is often seen jumping over lazy dogs.", "In summer, the lazy dog plays while the quick brown fox rests.", "A quick brown fox is quicker than the laziest dog." ] # Exclude words which appear in more than 3 documents vectorizer = CountVectorizer(max_df=3) bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
Setting max_df=3
excludes words that appear in more than 3 documents. In our corpus, these include words like "quick" and "brown". Given that they appear in every or in almost every document, they do not really help differentiate between documents. Alternatively, we could set max_df=0.6
, as 60% of 5 documents is 3 documents.
N-gram Range
The ngram_range
parameter allows you to define the range of n-gram sizes to be included in the vocabulary.
By default, CountVectorizer
considers only unigrams (single words). However, including bigrams (pairs of words), trigrams (triplets of words), or larger n-grams can enrich the model by capturing more context and semantic information, potentially improving performance.
This is achieved by passing a tuple (min_n, max_n)
to the ngram_range
parameter, where min_n
represents the minimum n-gram size to include, and max_n
represents the maximum size.
Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd corpus = [ "The quick brown fox jumps over the lazy dog.", "Quick brown foxes leap over lazy dogs in summer.", "The quick brown fox is often seen jumping over lazy dogs.", "In summer, the lazy dog plays while the quick brown fox rests.", "A quick brown fox is quicker than the laziest dog." ] # Include trigrams which appear in 2 or more documents vectorizer = CountVectorizer(min_df=2, ngram_range=(3, 3)) bow_matrix = vectorizer.fit_transform(corpus) bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(bow_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
¡Gracias por tus comentarios!