Implementing TF-IDF
Default Implementation
The implementation of the TF-IDF model in sklearn
is similar to that of the bag of words model. To train this model on a corpus, we use the TfidfVectorizer
class, utilizing the familiar .fit_transform()
method.
123456789101112131415from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
Aside from using a different class, the rest of the implementation is identical to that of the bag of words model. By default, the TF-IDF matrix is computed with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer
, we can specify the min_df
and max_df
parameters to include only terms that occur in at least min_df
documents and at most max_df
documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df
and max_df
to 2:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range
parameter. Let's include only bigrams in the resulting matrix:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Ask me questions about this topic
Summarize this chapter
Show real-world examples
Awesome!
Completion rate improved to 3.45
Implementing TF-IDF
Swipe to show menu
Default Implementation
The implementation of the TF-IDF model in sklearn
is similar to that of the bag of words model. To train this model on a corpus, we use the TfidfVectorizer
class, utilizing the familiar .fit_transform()
method.
123456789101112131415from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
Aside from using a different class, the rest of the implementation is identical to that of the bag of words model. By default, the TF-IDF matrix is computed with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer
, we can specify the min_df
and max_df
parameters to include only terms that occur in at least min_df
documents and at most max_df
documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df
and max_df
to 2:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range
parameter. Let's include only bigrams in the resulting matrix:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
Thanks for your feedback!