Зміст курсу
Introduction to NLP
Introduction to NLP
Lemmatization
Understanding Lemmatization
Lemmatization is a text normalization technique used in NLP to reduce words to their dictionary form, known as a lemma.
Unlike stemming, which crudely chops off affixes, lemmatization considers the context and converts the word to its dictionary form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.
On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part-of-speech tagging is recommended before lemmatization.
Don't worry about part-of-speech tagging for now, as this is the next thing you will learn about.
Lemmatization with NLTK
The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.
WordNet is a semantically rich lexical database for English that goes far beyond a simple corpus. It groups words into synonym sets, or synsets, each of which captures a distinct concept and is accompanied by definitions and usage examples. Moreover, WordNet encodes meaningful relationships between these synsets — such as hypernyms (broader, more general terms) and hyponyms (narrower, more specific terms) — offering a powerful framework for exploring and disambiguating word meanings.
When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma of the word.
As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.
from nltk.stem import WordNetLemmatizer import nltk # Download the WordNet corpus nltk.download('wordnet') # Initialize the WordNet lemmatizer lemmatizer = WordNetLemmatizer() # Parts of speech, 'v' for verb and 'n' for noun parts_of_speech = ['v', 'n'] # Lemmatize words lemmatized_words = [lemmatizer.lemmatize("running", pos) for pos in parts_of_speech] print("Lemmatized words:", lemmatized_words)
You could omit specifying the part of speech by calling lemmatizer.lemmatize("running")
, but as you can see, different parts of speech produce different results. That's why it would be best to perform part-of-speech tagging beforehand.
Дякуємо за ваш відгук!