Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda CBoW and Skip-gram Models | Word Embeddings
Introduction to NLP
course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
CBoW and Skip-gram Models

A basic understanding of neural networks is recommended for this chapter. If you're unfamiliar with the topic, feel free to explore this course:

Both the CBoW and Skip-gram architectures learn word embeddings through a neural network structure comprising of the following layers:

  • an input layer;

  • a single hidden layer;

  • an output layer.

The weight matrix between the input and hidden layers, denoted as W1W^1 or EE, serves as the embeddings matrix. Each row of this matrix represents an embedding vector for a corresponding word, with the ii-th row matching the ii-th word in the vocabulary.

This matrix contains VV (vocabulary size) embeddings, each of size NN, a dimension we specify. Multiplying the transpose of this matrix (N×VN \times V matrix) by a one-hot encoded vector (V×1V \times 1 vector) retrieves the embedding for a specific word, producing an N×1N \times 1 vector.

The second weight matrix, between the hidden and output layers, is sized N×VN \times V. Multiplying the transpose of this matrix (V×NV \times N matrix) by the hidden layer's N×1N \times 1 vector results in a V×1V \times 1 vector.

CBoW

Now, take a look at an example of using a CBoW model:

First, the transpose of the embeddings matrix is multiplied by the one-hot vectors of the context words to produce their embeddings. These embeddings are then summed or averaged depending on the implementation to form a single vector. This vector is multiplied by the W2W^2 matrix, resulting in a V×1V \times 1 vector.

Finally, this vector passes through the softmax activation function, converting it into a probability distribution, where each element represents the probability of a vocabulary word being the target word.

Afterward, the loss is calculated, and both weight matrices are updated to minimize this loss. Ideally, we want the probability of the target word to be close to 1, while the probabilities for all other words approach zero. This process is repeated for every combination of a target word and its context words.

Once all combinations have been processed, an epoch is completed. Typically, the neural network is trained over several epochs to ensure accurate learning. Finally, the rows of the resulting embeddings matrix can be used as our word embeddings. Each row corresponds to the vector representation of a specific word in the vocabulary, effectively capturing its semantic properties within the trained model.

Skip-gram

Let's now take a look at a skip-gram model:

As you can see, the process is mostly similar to CBoW. It begins by retrieving the embedding of the target word, which is then used in the hidden layer. This is followed by producing a V×1V \times 1 vector in the output layer. This vector, obtained by multiplying the target word's embedding with the output layer's weight matrix, is then transformed by the softmax activation function into a vector of probabilities.

Note
Note

Although this resulting vector of probabilities is the same for all context words associated with a single target word during a single training step, the loss for each context word is calculated individually.

The loss for each context word is summed up, and the weight matrices are updated accordingly at each iteration to minimize the total loss. Once the specified number of epochs is completed, the embeddings matrix can be used to obtain the word embeddings.

Note
Study More

In practice, especially with large vocabularies, the softmax function can be too computationally intensive. Therefore, approximations such as negative sampling are often employed to make the computation more efficient.

1. Fill in the blanks

2. What does the first weight matrix W1W^1 in the neural network represent?

question-icon

Fill in the blanks

The architecture tries to predict the target word from its context.
The
architecture tries to predict the context of the target word.

Clique ou arraste solte itens e preencha os espaços

question mark

What does the first weight matrix W1W^1 in the neural network represent?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
CBoW and Skip-gram Models

A basic understanding of neural networks is recommended for this chapter. If you're unfamiliar with the topic, feel free to explore this course:

Both the CBoW and Skip-gram architectures learn word embeddings through a neural network structure comprising of the following layers:

  • an input layer;

  • a single hidden layer;

  • an output layer.

The weight matrix between the input and hidden layers, denoted as W1W^1 or EE, serves as the embeddings matrix. Each row of this matrix represents an embedding vector for a corresponding word, with the ii-th row matching the ii-th word in the vocabulary.

This matrix contains VV (vocabulary size) embeddings, each of size NN, a dimension we specify. Multiplying the transpose of this matrix (N×VN \times V matrix) by a one-hot encoded vector (V×1V \times 1 vector) retrieves the embedding for a specific word, producing an N×1N \times 1 vector.

The second weight matrix, between the hidden and output layers, is sized N×VN \times V. Multiplying the transpose of this matrix (V×NV \times N matrix) by the hidden layer's N×1N \times 1 vector results in a V×1V \times 1 vector.

CBoW

Now, take a look at an example of using a CBoW model:

First, the transpose of the embeddings matrix is multiplied by the one-hot vectors of the context words to produce their embeddings. These embeddings are then summed or averaged depending on the implementation to form a single vector. This vector is multiplied by the W2W^2 matrix, resulting in a V×1V \times 1 vector.

Finally, this vector passes through the softmax activation function, converting it into a probability distribution, where each element represents the probability of a vocabulary word being the target word.

Afterward, the loss is calculated, and both weight matrices are updated to minimize this loss. Ideally, we want the probability of the target word to be close to 1, while the probabilities for all other words approach zero. This process is repeated for every combination of a target word and its context words.

Once all combinations have been processed, an epoch is completed. Typically, the neural network is trained over several epochs to ensure accurate learning. Finally, the rows of the resulting embeddings matrix can be used as our word embeddings. Each row corresponds to the vector representation of a specific word in the vocabulary, effectively capturing its semantic properties within the trained model.

Skip-gram

Let's now take a look at a skip-gram model:

As you can see, the process is mostly similar to CBoW. It begins by retrieving the embedding of the target word, which is then used in the hidden layer. This is followed by producing a V×1V \times 1 vector in the output layer. This vector, obtained by multiplying the target word's embedding with the output layer's weight matrix, is then transformed by the softmax activation function into a vector of probabilities.

Note
Note

Although this resulting vector of probabilities is the same for all context words associated with a single target word during a single training step, the loss for each context word is calculated individually.

The loss for each context word is summed up, and the weight matrices are updated accordingly at each iteration to minimize the total loss. Once the specified number of epochs is completed, the embeddings matrix can be used to obtain the word embeddings.

Note
Study More

In practice, especially with large vocabularies, the softmax function can be too computationally intensive. Therefore, approximations such as negative sampling are often employed to make the computation more efficient.

1. Fill in the blanks

2. What does the first weight matrix W1W^1 in the neural network represent?

question-icon

Fill in the blanks

The architecture tries to predict the target word from its context.
The
architecture tries to predict the context of the target word.

Clique ou arraste solte itens e preencha os espaços

question mark

What does the first weight matrix W1W^1 in the neural network represent?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2
some-alt