Зміст курсу
Introduction to NLP
Introduction to NLP
Implementing TF-IDF
Default Implementation
The implementation of the TF-IDF model in sklearn
is similar to that of the bag of words model. To train this model on a corpus, we use the TfidfVectorizer
class, utilizing the familiar .fit_transform()
method.
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
Aside from using a different class, the rest of the implementation is identical to that of the bag of words model. By default, the TF-IDF matrix is computed with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer
, we can specify the min_df
and max_df
parameters to include only terms that occur in at least min_df
documents and at most max_df
documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df
and max_df
to 2:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range
parameter. Let's include only bigrams in the resulting matrix:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
Дякуємо за ваш відгук!