Contenido del Curso
Introduction to NLP
Introduction to NLP
Word Embeddings Basics
Understanding Word Embeddings
Traditional text representation methods like bag of words and TF-IDF have notable limitations. They treat words in isolation, ignoring semantic relationships, and produce high-dimensional, sparse vectors that become computationally inefficient with large corpora.
Word embeddings address these issues by considering the context in which words appear, providing a more nuanced understanding of language.
Word embeddings are dense vector representations of words in a continuous vector space where semantically similar words are mapped to proximate points.
Several models and techniques have been developed to generate meaningful word embeddings:
Word2Vec: developed by Google, Word2Vec represents words as dense vectors using two architectures: continuous bag of words (CBoW), which predicts a word from its surrounding context, and Skip-gram, which predicts surrounding words from a given word;
GloVe: created at Stanford, GloVe (global vectors) generates word embeddings by analyzing global word co-occurrence statistics across the entire corpus, capturing semantic relationships based on the frequency with which word pairs appear together;
FastText: introduced by Facebook AI Research, FastText builds on Word2Vec by representing words as a collection of character n-grams. This enables it to model subword information, improving its ability to handle rare and out-of-vocabulary words, as well as morphologically rich languages.
Word2Vec and FastText are the most commonly used models for generating word embeddings. However, since FastText is merely an enhanced version of Word2Vec, we will skip it and focus only on Word2Vec.
How Word2Vec Works?
Word2Vec transforms words into vectors using a process that starts with one-hot encoding, where each word in a vocabulary is represented by a unique vector marked by a single 1 among zeros. Let's take a look at an example:
This vector serves as an input to a neural network, which is designed to 'learn' the word embeddings. The network's architecture can follow one of two models:
CBoW (continuous bag of words): predicts a target word based on the context provided by surrounding words;
Skip-gram: predicts the surrounding context words based on the target word.
In both Word2Vec architectures, during each training iteration, the model is provided with a target word and the words surrounding it as the context represented as one-hot encoded vectors. The training dataset is thus effectively composed of these pairs or groups, where each target word is associated with its surrounding context words.
Every word in the vocabulary takes a turn being the target as the model iterates through the text using a sliding context window technique. This technique systematically moves across every word, ensuring comprehensive learning from all possible contexts within the corpus.
A context window is a fixed number of words surrounding a target word that the model uses to learn its context. It defines how many words before and after the target word are taken into account during training.
Let's take a look at an example with a window size equal to 2 to make things clear:
A context window size of 2 means the model will include up to 2 words from both the left and the right sides of the target word, as long as those words are available within the text boundaries. As you can see, if there are fewer than 2 words on either side, the model will include as many words as are available.
¡Gracias por tus comentarios!