Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Lemmatization | Stemming and Lemmatization
Introduction to NLP
course content

Зміст курсу

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Lemmatization

Understanding Lemmatization

Note
Definition

Lemmatization is a text normalization technique used in NLP to reduce words to their dictionary form, known as a lemma.

Unlike stemming, which crudely chops off affixes, lemmatization considers the context and converts the word to its dictionary form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.

On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part-of-speech tagging is recommended before lemmatization.

Note
Note

Don't worry about part-of-speech tagging for now, as this is the next thing you will learn about.

Lemmatization with NLTK

The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.

Note
Study More

WordNet is a semantically rich lexical database for English that goes far beyond a simple corpus. It groups words into synonym sets, or synsets, each of which captures a distinct concept and is accompanied by definitions and usage examples. Moreover, WordNet encodes meaningful relationships between these synsets — such as hypernyms (broader, more general terms) and hyponyms (narrower, more specific terms) — offering a powerful framework for exploring and disambiguating word meanings.

When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma of the word.

As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.

1234567891011
from nltk.stem import WordNetLemmatizer import nltk # Download the WordNet corpus nltk.download('wordnet') # Initialize the WordNet lemmatizer lemmatizer = WordNetLemmatizer() # Parts of speech, 'v' for verb and 'n' for noun parts_of_speech = ['v', 'n'] # Lemmatize words lemmatized_words = [lemmatizer.lemmatize("running", pos) for pos in parts_of_speech] print("Lemmatized words:", lemmatized_words)
copy

You could omit specifying the part of speech by calling lemmatizer.lemmatize("running"), but as you can see, different parts of speech produce different results. That's why it would be best to perform part-of-speech tagging beforehand.

question mark

What's the primary benefit of using lemmatization compared to stemming?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

course content

Зміст курсу

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Lemmatization

Understanding Lemmatization

Note
Definition

Lemmatization is a text normalization technique used in NLP to reduce words to their dictionary form, known as a lemma.

Unlike stemming, which crudely chops off affixes, lemmatization considers the context and converts the word to its dictionary form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.

On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part-of-speech tagging is recommended before lemmatization.

Note
Note

Don't worry about part-of-speech tagging for now, as this is the next thing you will learn about.

Lemmatization with NLTK

The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.

Note
Study More

WordNet is a semantically rich lexical database for English that goes far beyond a simple corpus. It groups words into synonym sets, or synsets, each of which captures a distinct concept and is accompanied by definitions and usage examples. Moreover, WordNet encodes meaningful relationships between these synsets — such as hypernyms (broader, more general terms) and hyponyms (narrower, more specific terms) — offering a powerful framework for exploring and disambiguating word meanings.

When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma of the word.

As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.

1234567891011
from nltk.stem import WordNetLemmatizer import nltk # Download the WordNet corpus nltk.download('wordnet') # Initialize the WordNet lemmatizer lemmatizer = WordNetLemmatizer() # Parts of speech, 'v' for verb and 'n' for noun parts_of_speech = ['v', 'n'] # Lemmatize words lemmatized_words = [lemmatizer.lemmatize("running", pos) for pos in parts_of_speech] print("Lemmatized words:", lemmatized_words)
copy

You could omit specifying the part of speech by calling lemmatizer.lemmatize("running"), but as you can see, different parts of speech produce different results. That's why it would be best to perform part-of-speech tagging beforehand.

question mark

What's the primary benefit of using lemmatization compared to stemming?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3
some-alt