Contenido del Curso
Introduction to NLP
Introduction to NLP
TF-IDF
Understanding TF-IDF
While the bag of words model is simple and effective, it tends to overvalue common terms, making it harder to identify less frequent but more informative words. To address this, the TF-IDF model is often used instead.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that reflects how important a word is to a specific document relative to a larger corpus.
Unlike BoW, which relies on raw term counts, TF-IDF accounts for both a term's frequency within a document and its inverse frequency across the entire corpus. This reduces the influence of common terms and highlights rarer, more informative ones.
How TF-IDF Works
The TF-IDF score for a term in a document is calculated as:
where:
is the term (a word or n-gram);
is the document.
There are many variants for computing and values. Let's look at one common option for each:
Term frequency (TF)
Indicates how often a term appears in a document, capturing its relative importance within that document. Similar to the bag of words model, a simple count is often used:
Inverse document frequency (IDF)
Measures how rare a term is across the entire corpus. It can be computed as the natural logarithm of the ratio between the total number of documents and the number of documents containing the term:
This formula uses smoothing (adding 1) to avoid division by zero and ensures that even common terms receive a non-zero IDF score. In effect, IDF downweights frequent terms and emphasizes more informative, rare ones.
Without the IDF component, TF-IDF would reduce to a simple term count — essentially reverting to a bag of words model.
Calculating TF-IDF
Let's walk through a simple example:
In this case, we have just two documents and are using only unigrams (individual words), so the calculations are straightforward. We begin by computing the term frequencies for each word in both documents, followed by the IDF values for the terms "a" and "is".
Since there are only two documents in our corpus, every term that appears in both documents will have an IDF value of 1, while other terms will have an IDF value of ~1.406465.
Finally, we can compute the TF-IDF values for each term in each document by multiplying TF by IDF, resulting in the following matrix:
L2 Normalization
The resulting TF-IDF vectors can vary significantly in magnitude, especially in large corpora, due to differences in document length. That's why L2 normalization is commonly applied — to scale all vectors to a uniform length, enabling fair and unbiased comparisons of documents of different lengths.
L2 normalization, also known as Euclidean normalization, is a process applied to individual vectors that adjusts their values to ensure that the length of the vector is 1.
The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector.
If the document vector looks like this:
where is a weight of term ,
then the Euclidean norm looks like this:
and normalized vector looks like this:
Here is how L2 normalization works for a 2-dimensional vector (a document with 2 terms):
Don't worry if the formulas look complex. All we're doing is dividing each TF-IDF value in a document by the length (or magnitude) of that document's TF-IDF vector. This scales the vector so that its length becomes 1, ensuring consistent comparisons of vectors.
Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:
The resulting matrix is exactly what we had as an example in one of the previous chapters.
¡Gracias por tus comentarios!