Types of Vector Space Models

Vector space models can be categorized by the way they represent text, ranging from simple frequency-based methods to more advanced, context-aware embeddings. Each approach offers distinct advantages and is suited to different types of NLP tasks.

Bag of Words

Bag of words (BoW) is a vector space model that represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).

Here is an example of a frequency-based BoW:

As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag of words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.

Note

Text preprocessing is a necessary step before applying BoW or similar models.

TF-IDF

The TF-IDF (term frequency-inverse document frequency) model extends the bag of words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.

This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).

Here is the result of applying TF-IDF to the documents from the previous example:

The resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Word Embeddings and Document Embeddings

Word embeddings map individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities that are not directly interpretable.

Document embeddings, on the other hand, generate dense vectors that represent entire documents, capturing their overall semantic meaning.

Note

The dimensionality (size) of embeddings is typically chosen based on project requirements and available computational resources. Selecting the right size is crucial for striking a balance between capturing rich semantic information and maintaining model efficiency.

Here's an example of what word embeddings for the words "cat", "kitten", "dog", and "house" might look like:

Although the numerical values in this table are arbitrary, they illustrate how embeddings can represent meaningful relationships between words.

In real-world applications, such embeddings are learned by training a model on a large text corpus, enabling it to discover subtle patterns and semantic relationships within natural language.

Study More

A further advancement in dense representations, contextual embeddings (generated by models like BERT and GPT), considers the context in which a word appears to generate its vector. This means the same word can have different embeddings based on its usage in different sentences, providing a nuanced understanding of language.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to NLP

Types of Vector Space Models

Bag of Words

Here is an example of a frequency-based BoW:

Note

Text preprocessing is a necessary step before applying BoW or similar models.

TF-IDF

Here is the result of applying TF-IDF to the documents from the previous example:

The resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Word Embeddings and Document Embeddings

Word embeddings map individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities that are not directly interpretable.

Document embeddings, on the other hand, generate dense vectors that represent entire documents, capturing their overall semantic meaning.

Note

Here's an example of what word embeddings for the words "cat", "kitten", "dog", and "house" might look like:

Although the numerical values in this table are arbitrary, they illustrate how embeddings can represent meaningful relationships between words.

In real-world applications, such embeddings are learned by training a model on a large text corpus, enabling it to discover subtle patterns and semantic relationships within natural language.

Study More

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2