Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Implementing TF-IDF | Basic Text Models
Introduction to NLP
course content

Course Content

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

bookImplementing TF-IDF

Default Implementation

The implementation of the TF-IDF model in sklearn is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer class utilizing the already familiar to us method .fit_transform().

Let's take a look at an example:

123456789101112131415
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.

Customizing TF-IDF

Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.

Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Task

Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:

  1. Import the TfidfVectorizer class to create a TF-IDF model.
  2. Instantiate the TfidfVectorizer class as tfidf_vectorizer that includes both unigrams, bigrams, and trigrams.
  3. Utilize the appropriate method of tfidf_vectorizer to generate a TF-IDF matrix from the 'Document' column in the corpus.
  4. Convert tfidf_matrix to a dense array and create a DataFrame from it, setting the unique features (terms) as its columns. Assign this to the variable tfidf_matrix_df.
  5. Display the vector for 'medical' as an array, rather than as a pandas Series.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 7
toggle bottom row

bookImplementing TF-IDF

Default Implementation

The implementation of the TF-IDF model in sklearn is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer class utilizing the already familiar to us method .fit_transform().

Let's take a look at an example:

123456789101112131415
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.

Customizing TF-IDF

Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.

Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Task

Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:

  1. Import the TfidfVectorizer class to create a TF-IDF model.
  2. Instantiate the TfidfVectorizer class as tfidf_vectorizer that includes both unigrams, bigrams, and trigrams.
  3. Utilize the appropriate method of tfidf_vectorizer to generate a TF-IDF matrix from the 'Document' column in the corpus.
  4. Convert tfidf_matrix to a dense array and create a DataFrame from it, setting the unique features (terms) as its columns. Assign this to the variable tfidf_matrix_df.
  5. Display the vector for 'medical' as an array, rather than as a pandas Series.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 7
toggle bottom row

bookImplementing TF-IDF

Default Implementation

The implementation of the TF-IDF model in sklearn is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer class utilizing the already familiar to us method .fit_transform().

Let's take a look at an example:

123456789101112131415
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.

Customizing TF-IDF

Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.

Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Task

Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:

  1. Import the TfidfVectorizer class to create a TF-IDF model.
  2. Instantiate the TfidfVectorizer class as tfidf_vectorizer that includes both unigrams, bigrams, and trigrams.
  3. Utilize the appropriate method of tfidf_vectorizer to generate a TF-IDF matrix from the 'Document' column in the corpus.
  4. Convert tfidf_matrix to a dense array and create a DataFrame from it, setting the unique features (terms) as its columns. Assign this to the variable tfidf_matrix_df.
  5. Display the vector for 'medical' as an array, rather than as a pandas Series.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Default Implementation

The implementation of the TF-IDF model in sklearn is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer class utilizing the already familiar to us method .fit_transform().

Let's take a look at an example:

123456789101112131415
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.

Customizing TF-IDF

Once again, similar to CountVectorizer, we can specify the min_df and max_df parameters to include only terms that occur in at least min_df documents and at most max_df documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.

Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df and max_df to 2:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

To specify the n-grams to include in our matrix, we can use the ngram_range parameter. Let's include only bigrams in the resulting matrix:

12345678910111213
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
copy

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Task

Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:

  1. Import the TfidfVectorizer class to create a TF-IDF model.
  2. Instantiate the TfidfVectorizer class as tfidf_vectorizer that includes both unigrams, bigrams, and trigrams.
  3. Utilize the appropriate method of tfidf_vectorizer to generate a TF-IDF matrix from the 'Document' column in the corpus.
  4. Convert tfidf_matrix to a dense array and create a DataFrame from it, setting the unique features (terms) as its columns. Assign this to the variable tfidf_matrix_df.
  5. Display the vector for 'medical' as an array, rather than as a pandas Series.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Section 3. Chapter 7
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
some-alt