Conteúdo do Curso
Introduction to NLP
Introduction to NLP
Vector Space Models
The Need for Numerical Representation
Computers can't interpret text the way humans do. While we derive meaning from language through context, culture, and experience, computers see nothing more than sequences of characters.
To make text accessible to machines, we must translate it into their native language: numbers. Representing text with vectors and matrices enables mathematical and statistical models to uncover patterns, relationships, and insights that would otherwise remain hidden in raw text.
Understanding Vector Space Models
Luckily for us, effective solutions for converting text into numerical form already exist. One of the most widely adopted approaches is the use of vector space models.
Vector space model (VSM) is a mathematical model that represents text documents, words, or any other items as vectors in a multidimensional space.
There are many ways to construct such vector spaces for text documents. One simple approach is to use the entire corpus vocabulary, assigning each dimension of the space to a unique term.
Vocabulary is the complete set of unique terms that appear across a given corpus.
Let the corpus vocabulary be denoted as and the set of documents as . Then, each document can be represented as a vector in :
where:
is the total number of unique terms in the vocabulary;
denotes the weight or importance of the term in document .
Here's a simple example with just 2 documents and 2 unique terms, visualized in a 2D vector space:
Using these vector representations, we can compute a similarity score between documents by measuring the angle between their vectors, typically using cosine similarity.
Words as Vectors
The idea behind VSMs can be extended to individual word representations through the technique known as word embeddings. Word embeddings operate under a similar mathematical principle, but focus on representing individual words as vectors rather than entire documents. The dimensions in these vectors capture latent semantic features that are not directly interpretable.
Here is an example with 2-dimensional embeddings for three words:
As illustrated in the image, the vectors for "woman" and "queen", as well as for "queen" and "king", are positioned closely, indicating strong semantic similarity. In contrast, the wider angle between "woman" and "king" suggests a greater semantic difference.
Don't worry about word embeddings for now, we will discuss them later.
Applications of Vector Space Models
Vector space models are used in a wide variety of NLP tasks:
Semantic similarity: computing the similarity between text documents or words based on their vector representations;
Information retrieval: enhancing search engines and recommendation systems to find content relevant to a user's query;
Text classification and clustering: automatically categorizing documents into predefined classes or grouping similar documents together;
Natural language understanding: facilitating deeper linguistic analysis that paves the way for applications like sentiment analysis, topic modeling, and more.
Obrigado pelo seu feedback!