Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Vector Space Models | Basic Text Models
Introduction to NLP
course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Vector Space Models

The Need for Numerical Representation

Computers can't interpret text the way humans do. While we derive meaning from language through context, culture, and experience, computers see nothing more than sequences of characters.

To make text accessible to machines, we must translate it into their native language: numbers. Representing text with vectors and matrices enables mathematical and statistical models to uncover patterns, relationships, and insights that would otherwise remain hidden in raw text.

Understanding Vector Space Models

Luckily for us, effective solutions for converting text into numerical form already exist. One of the most widely adopted approaches is the use of vector space models.

Note
Definition

Vector space model (VSM) is a mathematical model that represents text documents, words, or any other items as vectors in a multidimensional space.

There are many ways to construct such vector spaces for text documents. One simple approach is to use the entire corpus vocabulary, assigning each dimension of the space to a unique term.

Note
Definition

Vocabulary is the complete set of unique terms that appear across a given corpus.

Let the corpus vocabulary be denoted as VV and the set of documents as DD. Then, each document diDd_i \in D can be represented as a vector in RN\R^N:

di=(w1,i,w2,i,...,wN,i)d_i = (w_{1,i}, w_{2,i}, ..., w_{N,i})

where:

  • N=VN = |V| is the total number of unique terms in the vocabulary;

  • wj,iw_{j,i} denotes the weight or importance of the term WjVW_j \in V in document did_i.

Here's a simple example with just 2 documents and 2 unique terms, visualized in a 2D vector space:

Using these vector representations, we can compute a similarity score between documents by measuring the angle between their vectors, typically using cosine similarity.

Words as Vectors

The idea behind VSMs can be extended to individual word representations through the technique known as word embeddings. Word embeddings operate under a similar mathematical principle, but focus on representing individual words as vectors rather than entire documents. The dimensions in these vectors capture latent semantic features that are not directly interpretable.

Here is an example with 2-dimensional embeddings for three words:

As illustrated in the image, the vectors for "woman" and "queen", as well as for "queen" and "king", are positioned closely, indicating strong semantic similarity. In contrast, the wider angle between "woman" and "king" suggests a greater semantic difference.

Note
Note

Don't worry about word embeddings for now, we will discuss them later.

Applications of Vector Space Models

Vector space models are used in a wide variety of NLP tasks:

  • Semantic similarity: computing the similarity between text documents or words based on their vector representations;

  • Information retrieval: enhancing search engines and recommendation systems to find content relevant to a user's query;

  • Text classification and clustering: automatically categorizing documents into predefined classes or grouping similar documents together;

  • Natural language understanding: facilitating deeper linguistic analysis that paves the way for applications like sentiment analysis, topic modeling, and more.

question mark

What are vector space models used for?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 1

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Vector Space Models

The Need for Numerical Representation

Computers can't interpret text the way humans do. While we derive meaning from language through context, culture, and experience, computers see nothing more than sequences of characters.

To make text accessible to machines, we must translate it into their native language: numbers. Representing text with vectors and matrices enables mathematical and statistical models to uncover patterns, relationships, and insights that would otherwise remain hidden in raw text.

Understanding Vector Space Models

Luckily for us, effective solutions for converting text into numerical form already exist. One of the most widely adopted approaches is the use of vector space models.

Note
Definition

Vector space model (VSM) is a mathematical model that represents text documents, words, or any other items as vectors in a multidimensional space.

There are many ways to construct such vector spaces for text documents. One simple approach is to use the entire corpus vocabulary, assigning each dimension of the space to a unique term.

Note
Definition

Vocabulary is the complete set of unique terms that appear across a given corpus.

Let the corpus vocabulary be denoted as VV and the set of documents as DD. Then, each document diDd_i \in D can be represented as a vector in RN\R^N:

di=(w1,i,w2,i,...,wN,i)d_i = (w_{1,i}, w_{2,i}, ..., w_{N,i})

where:

  • N=VN = |V| is the total number of unique terms in the vocabulary;

  • wj,iw_{j,i} denotes the weight or importance of the term WjVW_j \in V in document did_i.

Here's a simple example with just 2 documents and 2 unique terms, visualized in a 2D vector space:

Using these vector representations, we can compute a similarity score between documents by measuring the angle between their vectors, typically using cosine similarity.

Words as Vectors

The idea behind VSMs can be extended to individual word representations through the technique known as word embeddings. Word embeddings operate under a similar mathematical principle, but focus on representing individual words as vectors rather than entire documents. The dimensions in these vectors capture latent semantic features that are not directly interpretable.

Here is an example with 2-dimensional embeddings for three words:

As illustrated in the image, the vectors for "woman" and "queen", as well as for "queen" and "king", are positioned closely, indicating strong semantic similarity. In contrast, the wider angle between "woman" and "king" suggests a greater semantic difference.

Note
Note

Don't worry about word embeddings for now, we will discuss them later.

Applications of Vector Space Models

Vector space models are used in a wide variety of NLP tasks:

  • Semantic similarity: computing the similarity between text documents or words based on their vector representations;

  • Information retrieval: enhancing search engines and recommendation systems to find content relevant to a user's query;

  • Text classification and clustering: automatically categorizing documents into predefined classes or grouping similar documents together;

  • Natural language understanding: facilitating deeper linguistic analysis that paves the way for applications like sentiment analysis, topic modeling, and more.

question mark

What are vector space models used for?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 1
some-alt