Learn Implementing TF-IDF | Basic Text Models

Default Implementation

The implementation of the TF-IDF model in sklearn is similar to that of the bag of words model. To train this model on a corpus, we use the TfidfVectorizer class, utilizing the familiar .fit_transform() method.


              123456789101112131415
            
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
# Create a default TF-IDF model
vectorizer = TfidfVectorizer()
# Generate a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(corpus)
# Convert a sparse matrix into a DataFrame
tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_matrix_df)

Aside from using a different class, the rest of the implementation is identical to that of the bag of words model. By default, the TF-IDF matrix is computed with L2 normalization.

Customizing TF-IDF

Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.

Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:


              12345678910111213
            
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
# Include terms which appear in exactly 2 documents
vectorizer = TfidfVectorizer(min_df=2, max_df=2)
tfidf_matrix = vectorizer.fit_transform(corpus)
tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_matrix_df)

To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:


              12345678910111213
            
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'Global climate change poses significant risks to global ecosystems.',
    'Global warming and climate change demand urgent action.',
    'Sustainable environmental practices support environmental conservation.',
]
# Include only bigrams
vectorizer = TfidfVectorizer(ngram_range=(2, 2))
tfidf_matrix = vectorizer.fit_transform(corpus)
tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_matrix_df)

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how the TF-IDF values are calculated in the matrix?

What is the difference between min_df and max_df parameters?

How does using ngram_range affect the resulting TF-IDF matrix?

Awesome!

Completion rate improved to 3.45

Swipe to show menu