Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Key Types of Vector Space Models | Basic Text Models
Introduction to NLP
course content

Зміст курсу

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Key Types of Vector Space Models

Vector space models can be broadly classified based on the nature of the representation they provide, each with unique characteristics and use cases. Let's now discuss the key concepts around these models, deferring their implementation for later chapters.

Bag of Words

Bag of Words (BoW) is a vector space model which represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).

Here is an example of a a frequency-based BoW:

As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag-of-words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.

TF-IDF

The TF-IDF (Term Frequency-Inverse Document Frequency) model extends the Bag of Words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.

This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).

Let's modify our previous example with this model:

In one of the upcoming chapters, we will learn how to calculate the TF-IDF value for each word. For now, it's important to note that the resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Words Embeddings and Document Embeddings

We have already mentioned word embeddings in the previous chapter. Essentially, this model maps individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities, which are not actually directly interpretable.

Document embeddings, on the other hand, generate dense vectors representing whole documents, capturing the overall semantic meaning.

Let's take a look at an example with the word embeddings for the words "cat", "kitten", "dog", and "house":

We have chosen the size of the embeddings to be 6. Although the numerical values are arbitrary, they effectively demonstrate how the embeddings accurately reflect the similarities among words.

In a real-world scenario, these embeddings would be derived from training a model on a text corpus, allowing it to 'learn' the nuanced relationships between words based on actual language use. We will accomplish this in one of the upcoming chapters, stay tuned!

question-icon

Order the models by their complexity, from simplest to most complex.

1.
2.

3.

4.

Натисніть або перетягніть елементи та заповніть пропуски

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 2
We're sorry to hear that something went wrong. What happened?
some-alt