Implementing TF-IDF
メニューを表示するにはスワイプしてください
Default Implementation
The implementation of the TF-IDF model in sklearn is similar to that of the bag of words model. To train this model on a corpus, we use the TfidfVectorizer class, utilizing the familiar .fit_transform() method.
123456789101112131415from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
Aside from using a different class, the rest of the implementation is identical to that of the bag of words model. By default, the TF-IDF matrix is computed with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:
12345678910111213from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください