Contenido del Curso
Introduction to NLP
Introduction to NLP
Types of Vector Space Models
Vector space models can be categorized by the way they represent text, ranging from simple frequency-based methods to more advanced, context-aware embeddings. Each approach offers distinct advantages and is suited to different types of NLP tasks.
Bag of Words
Bag of words (BoW) is a vector space model that represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).
Here is an example of a frequency-based BoW:
As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag of words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.
Text preprocessing is a necessary step before applying BoW or similar models.
TF-IDF
The TF-IDF (term frequency-inverse document frequency) model extends the bag of words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.
This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).
Here is the result of applying TF-IDF to the documents from the previous example:
The resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.
Word Embeddings and Document Embeddings
Word embeddings map individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities that are not directly interpretable.
Document embeddings, on the other hand, generate dense vectors that represent entire documents, capturing their overall semantic meaning.
The dimensionality (size) of embeddings is typically chosen based on project requirements and available computational resources. Selecting the right size is crucial for striking a balance between capturing rich semantic information and maintaining model efficiency.
Here's an example of what word embeddings for the words "cat", "kitten", "dog", and "house" might look like:
Although the numerical values in this table are arbitrary, they illustrate how embeddings can represent meaningful relationships between words.
In real-world applications, such embeddings are learned by training a model on a large text corpus, enabling it to discover subtle patterns and semantic relationships within natural language.
A further advancement in dense representations, contextual embeddings (generated by models like BERT and GPT), considers the context in which a word appears to generate its vector. This means the same word can have different embeddings based on its usage in different sentences, providing a nuanced understanding of language.
¡Gracias por tus comentarios!