Зміст курсу

Introduction to NLP

Customizing Bag of Words

The Bag of Words model, particularly its implementation through the CountVectorizer class, offers several parameters for customization. This allows it to be tailored to the specific needs of various text analysis tasks, significantly enhancing the model's effectiveness.

Minimum and Maximum Document Frequency

The min_df parameter defines the minimum number of documents a term must appear in to be included in the vocabulary, either as an absolute number or a proportion. It helps exclude rare terms, which are often less informative.

Similarly, max_df determines the maximum frequency a term can have across documents to remain in the vocabulary, also specifiable as an absolute number or proportion. It filters out overly common terms that don't contribute to distinguishing between documents.


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Exclude words which appear in more than 3 documents
vectorizer = CountVectorizer(max_df=3)
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

Setting max_df=3 excludes words that appear in more than 3 documents. In our corpus, these include words like "quick" and "brown". Given that they appear in every or in almost every document, they do not really help differentiate between documents. Alternatively, we could set max_df=0.6, as 60% of 5 documents is 3 documents.

N-gram Range

The ngram_range parameter allows you to define the range of n-gram sizes to be included in the vocabulary.

Definition

An n-gram is a contiguous sequence of n items from a given sample of text. These items are typically words (in our case), syllables, or letters.

By default, CountVectorizer considers only unigrams (single words). However, including bigrams (pairs of words), trigrams (triplets of words), or larger n-grams can enrich the model by capturing more context and semantic information, potentially improving performance.

This is achieved by passing a tuple (min_n, max_n) to the ngram_range parameter, where min_n represents the minimum n-gram size to include, and max_n represents the maximum size.

Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Include trigrams which appear in 2 or more documents
vectorizer = CountVectorizer(min_df=2, ngram_range=(3, 3))
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

1. What does the `min_df` parameter in `CountVectorizer` control?

2. What is the purpose of the `ngram_range` parameter in `CountVectorizer`?

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 4

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Зміст курсу

Introduction to NLP

Customizing Bag of Words

Minimum and Maximum Document Frequency


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Exclude words which appear in more than 3 documents
vectorizer = CountVectorizer(max_df=3)
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

N-gram Range

The ngram_range parameter allows you to define the range of n-gram sizes to be included in the vocabulary.

Definition

An n-gram is a contiguous sequence of n items from a given sample of text. These items are typically words (in our case), syllables, or letters.

This is achieved by passing a tuple (min_n, max_n) to the ngram_range parameter, where min_n represents the minimum n-gram size to include, and max_n represents the maximum size.

Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Include trigrams which appear in 2 or more documents
vectorizer = CountVectorizer(min_df=2, ngram_range=(3, 3))
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

1. What does the `min_df` parameter in `CountVectorizer` control?

2. What is the purpose of the `ngram_range` parameter in `CountVectorizer`?

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 4

Introduction to NLP

Customizing Bag of Words

Minimum and Maximum Document Frequency

N-gram Range

1. What does the min_df parameter in CountVectorizer control?

2. What is the purpose of the ngram_range parameter in CountVectorizer?

Introduction to NLP

Customizing Bag of Words

Minimum and Maximum Document Frequency

N-gram Range

1. What does the min_df parameter in CountVectorizer control?

2. What is the purpose of the ngram_range parameter in CountVectorizer?

1. What does the `min_df` parameter in `CountVectorizer` control?

2. What is the purpose of the `ngram_range` parameter in `CountVectorizer`?

1. What does the `min_df` parameter in `CountVectorizer` control?

2. What is the purpose of the `ngram_range` parameter in `CountVectorizer`?