Learn Vector Space Models | Basic Text Models

Swipe to show menu

The Need for Numerical Representation

Computers can't interpret text the way humans do. While we derive meaning from language through context, culture, and experience, computers see nothing more than sequences of characters.

To make text accessible to machines, we must translate it into their native language: numbers. Representing text with vectors and matrices enables mathematical and statistical models to uncover patterns, relationships, and insights that would otherwise remain hidden in raw text.

Understanding Vector Space Models

Luckily for us, effective solutions for converting text into numerical form already exist. One of the most widely adopted approaches is the use of vector space models.

Definition

Vector space model (VSM) is a mathematical model that represents text documents, words, or any other items as vectors in a multidimensional space.

There are many ways to construct such vector spaces for text documents. One simple approach is to use the entire corpus vocabulary, assigning each dimension of the space to a unique term.

Definition

Vocabulary is the complete set of unique terms that appear across a given corpus.

Let the corpus vocabulary be denoted as $V$ and the set of documents as $D$ . Then, each document $d_i \in D$ can be represented as a vector in $\R^N$ :

d_i = (w_{1,i}, w_{2,i}, ..., w_{N,i})

where:

$N = |V|$ is the total number of unique terms in the vocabulary;
$w_{j,i}$ denotes the weight or importance of the term $W_j \in V$ in document $d_i$ .

Here's a simple example with just 2 documents and 2 unique terms, visualized in a 2D vector space:

Using these vector representations, we can compute a similarity score between documents by measuring the angle between their vectors, typically using cosine similarity.

Words as Vectors

The idea behind VSMs can be extended to individual word representations through the technique known as word embeddings. Word embeddings operate under a similar mathematical principle, but focus on representing individual words as vectors rather than entire documents. The dimensions in these vectors capture latent semantic features that are not directly interpretable.

Here is an example with 2-dimensional embeddings for three words:

As illustrated in the image, the vectors for "woman" and "queen", as well as for "queen" and "king", are positioned closely, indicating strong semantic similarity. In contrast, the wider angle between "woman" and "king" suggests a greater semantic difference.

Note

Don't worry about word embeddings for now, we will discuss them later.

Applications of Vector Space Models

Vector space models are used in a wide variety of NLP tasks:

Semantic similarity: computing the similarity between text documents or words based on their vector representations;
Information retrieval: enhancing search engines and recommendation systems to find content relevant to a user's query;
Text classification and clustering: automatically categorizing documents into predefined classes or grouping similar documents together;
Natural language understanding: facilitating deeper linguistic analysis that paves the way for applications like sentiment analysis, topic modeling, and more.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 1