Lemmatization

Understanding Lemmatization

Definition

Lemmatization is a text normalization technique used in NLP to reduce words to their dictionary form, known as a lemma.

Unlike stemming, which crudely chops off affixes, lemmatization considers the context and converts the word to its dictionary form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.

On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part-of-speech tagging is recommended before lemmatization.

Note

Don't worry about part-of-speech tagging for now, as this is the next thing you will learn about.

Lemmatization with NLTK

The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.

Study More

WordNet is a semantically rich lexical database for English that goes far beyond a simple corpus. It groups words into synonym sets, or synsets, each of which captures a distinct concept and is accompanied by definitions and usage examples. Moreover, WordNet encodes meaningful relationships between these synsets — such as hypernyms (broader, more general terms) and hyponyms (narrower, more specific terms) — offering a powerful framework for exploring and disambiguating word meanings.

When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma of the word.

As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.


              1234567891011
            
from nltk.stem import WordNetLemmatizer
import nltk
# Download the WordNet corpus
nltk.download('wordnet')
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()
# Parts of speech, 'v' for verb and 'n' for noun
parts_of_speech = ['v', 'n']
# Lemmatize words
lemmatized_words = [lemmatizer.lemmatize("running", pos) for pos in parts_of_speech]
print("Lemmatized words:", lemmatized_words)

You could omit specifying the part of speech by calling lemmatizer.lemmatize("running"), but as you can see, different parts of speech produce different results. That's why it would be best to perform part-of-speech tagging beforehand.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 3

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Kursinnehåll

Introduction to NLP