Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende TF-IDF | Basic Text Models
Introduction to NLP
course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
TF-IDF

Understanding TF-IDF

While the bag of words model is simple and effective, it tends to overvalue common terms, making it harder to identify less frequent but more informative words. To address this, the TF-IDF model is often used instead.

Note
Definition

TF-IDF (term frequency-inverse document frequency) is a statistical measure that reflects how important a word is to a specific document relative to a larger corpus.

Unlike BoW, which relies on raw term counts, TF-IDF accounts for both a term's frequency within a document and its inverse frequency across the entire corpus. This reduces the influence of common terms and highlights rarer, more informative ones.

How TF-IDF Works

The TF-IDF score for a term in a document is calculated as:

tf-idf(t,d)=tf(t,d)×idf(t)\def\tfidf{\operatorname{tf-idf}} \def\tf{\operatorname{tf}} \def\idf{\operatorname{idf}} \tfidf(t, d) = \tf(t, d) \times \idf(t)

where:

  • tt is the term (a word or n-gram);

  • dd is the document.

There are many variants for computing tf\operatorname{tf} and idf\operatorname{idf} values. Let's look at one common option for each:

Term frequency (TF)

Indicates how often a term appears in a document, capturing its relative importance within that document. Similar to the bag of words model, a simple count is often used:

tf(t,d)=count(t,d)\def\tf{\operatorname{tf}} \def\count{\operatorname{count}} \tf(t, d) = \count(t, d)

Inverse document frequency (IDF)

Measures how rare a term is across the entire corpus. It can be computed as the natural logarithm of the ratio between the total number of documents and the number of documents containing the term:

idf(t)=log(1+Ndocuments1+df(t))+1\def\idf{\operatorname{idf}} \def\df{\operatorname{df}} \idf(t) = \log\Bigl(\frac{1 + N_{documents}}{1 + \df(t)}\Bigr) + 1

This formula uses smoothing (adding 1) to avoid division by zero and ensures that even common terms receive a non-zero IDF score. In effect, IDF downweights frequent terms and emphasizes more informative, rare ones.

Without the IDF component, TF-IDF would reduce to a simple term count — essentially reverting to a bag of words model.

Calculating TF-IDF

Let's walk through a simple example:

In this case, we have just two documents and are using only unigrams (individual words), so the calculations are straightforward. We begin by computing the term frequencies for each word in both documents, followed by the IDF values for the terms "a" and "is".

Note
Note

Since there are only two documents in our corpus, every term that appears in both documents will have an IDF value of 1, while other terms will have an IDF value of ~1.406465.

Finally, we can compute the TF-IDF values for each term in each document by multiplying TF by IDF, resulting in the following matrix:

L2 Normalization

The resulting TF-IDF vectors can vary significantly in magnitude, especially in large corpora, due to differences in document length. That's why L2 normalization is commonly applied — to scale all vectors to a uniform length, enabling fair and unbiased comparisons of documents of different lengths.

Note
Study More

L2 normalization, also known as Euclidean normalization, is a process applied to individual vectors that adjusts their values to ensure that the length of the vector is 1.

The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector.

If the document vector looks like this:

d=(w1,w2,w3,...,wN)d = (w_1, w_2, w_3, ..., w_N)

where wiw_i is a weight of term ii,

then the Euclidean norm looks like this:

d2=w12+w22+w32+...+wN2\|d\|_2 = \sqrt{w^2_1 + w^2_2 + w^2_3 + ... + w^2_N}

and normalized vector looks like this:

dnorm=(w1d2,w2d2,w3d2,...,wNd2)d_{norm} = \Bigl(\frac{w_1}{\|d\|_2}, \frac{w_2}{\|d\|_2}, \frac{w_3}{\|d\|_2}, ..., \frac{w_N}{\|d\|_2})

Here is how L2 normalization works for a 2-dimensional vector (a document with 2 terms):

Note
Note

Don't worry if the formulas look complex. All we're doing is dividing each TF-IDF value in a document by the length (or magnitude) of that document's TF-IDF vector. This scales the vector so that its length becomes 1, ensuring consistent comparisons of vectors.

Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:

The resulting matrix is exactly what we had as an example in one of the previous chapters.

question mark

What is the key advantage of the TF-IDF model in comparison to the BoW model?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 6

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
TF-IDF

Understanding TF-IDF

While the bag of words model is simple and effective, it tends to overvalue common terms, making it harder to identify less frequent but more informative words. To address this, the TF-IDF model is often used instead.

Note
Definition

TF-IDF (term frequency-inverse document frequency) is a statistical measure that reflects how important a word is to a specific document relative to a larger corpus.

Unlike BoW, which relies on raw term counts, TF-IDF accounts for both a term's frequency within a document and its inverse frequency across the entire corpus. This reduces the influence of common terms and highlights rarer, more informative ones.

How TF-IDF Works

The TF-IDF score for a term in a document is calculated as:

tf-idf(t,d)=tf(t,d)×idf(t)\def\tfidf{\operatorname{tf-idf}} \def\tf{\operatorname{tf}} \def\idf{\operatorname{idf}} \tfidf(t, d) = \tf(t, d) \times \idf(t)

where:

  • tt is the term (a word or n-gram);

  • dd is the document.

There are many variants for computing tf\operatorname{tf} and idf\operatorname{idf} values. Let's look at one common option for each:

Term frequency (TF)

Indicates how often a term appears in a document, capturing its relative importance within that document. Similar to the bag of words model, a simple count is often used:

tf(t,d)=count(t,d)\def\tf{\operatorname{tf}} \def\count{\operatorname{count}} \tf(t, d) = \count(t, d)

Inverse document frequency (IDF)

Measures how rare a term is across the entire corpus. It can be computed as the natural logarithm of the ratio between the total number of documents and the number of documents containing the term:

idf(t)=log(1+Ndocuments1+df(t))+1\def\idf{\operatorname{idf}} \def\df{\operatorname{df}} \idf(t) = \log\Bigl(\frac{1 + N_{documents}}{1 + \df(t)}\Bigr) + 1

This formula uses smoothing (adding 1) to avoid division by zero and ensures that even common terms receive a non-zero IDF score. In effect, IDF downweights frequent terms and emphasizes more informative, rare ones.

Without the IDF component, TF-IDF would reduce to a simple term count — essentially reverting to a bag of words model.

Calculating TF-IDF

Let's walk through a simple example:

In this case, we have just two documents and are using only unigrams (individual words), so the calculations are straightforward. We begin by computing the term frequencies for each word in both documents, followed by the IDF values for the terms "a" and "is".

Note
Note

Since there are only two documents in our corpus, every term that appears in both documents will have an IDF value of 1, while other terms will have an IDF value of ~1.406465.

Finally, we can compute the TF-IDF values for each term in each document by multiplying TF by IDF, resulting in the following matrix:

L2 Normalization

The resulting TF-IDF vectors can vary significantly in magnitude, especially in large corpora, due to differences in document length. That's why L2 normalization is commonly applied — to scale all vectors to a uniform length, enabling fair and unbiased comparisons of documents of different lengths.

Note
Study More

L2 normalization, also known as Euclidean normalization, is a process applied to individual vectors that adjusts their values to ensure that the length of the vector is 1.

The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector.

If the document vector looks like this:

d=(w1,w2,w3,...,wN)d = (w_1, w_2, w_3, ..., w_N)

where wiw_i is a weight of term ii,

then the Euclidean norm looks like this:

d2=w12+w22+w32+...+wN2\|d\|_2 = \sqrt{w^2_1 + w^2_2 + w^2_3 + ... + w^2_N}

and normalized vector looks like this:

dnorm=(w1d2,w2d2,w3d2,...,wNd2)d_{norm} = \Bigl(\frac{w_1}{\|d\|_2}, \frac{w_2}{\|d\|_2}, \frac{w_3}{\|d\|_2}, ..., \frac{w_N}{\|d\|_2})

Here is how L2 normalization works for a 2-dimensional vector (a document with 2 terms):

Note
Note

Don't worry if the formulas look complex. All we're doing is dividing each TF-IDF value in a document by the length (or magnitude) of that document's TF-IDF vector. This scales the vector so that its length becomes 1, ensuring consistent comparisons of vectors.

Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:

The resulting matrix is exactly what we had as an example in one of the previous chapters.

question mark

What is the key advantage of the TF-IDF model in comparison to the BoW model?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 6
some-alt