Learn Lemmatization with POS Tagging | Stemming and Lemmatization

Swipe to show menu

The English language is full of words that can serve as multiple parts of speech with different meanings. For example, "running" can be a verb ("He is running.") or a noun ("Running is fun.").

As we have already seen, a lemmatizer can only accurately reduce a word to its base form if it knows the word's part of speech in the given context. POS tagging in turn provides this context, making lemmatization more precise.

Lemmatization with POS Tagging in NLTK

Since we are already familiar with both of these techniques separately, it's time to combine them. However, there is one important aspect to take into consideration, namely the difference in POS tags format between pos_tagand the format that WordNet Lemmatizer expects.

Study More

NLTK's pos_tag function utilizes the Penn Treebank tag set, which includes a wide range of tags for detailed grammatical categorization. The WordNet Lemmatizer, on the other hand, expects POS tags in a simplified form that aligns with WordNet's own categorization. Specifically, it only differentiates among nouns ('n'), verbs ('v'), adjectives ('a' or 's' for satellite adjectives), and adverbs ('r').

The mapping process involves converting detailed Penn Treebank tags to broader categories recognized by WordNet. For example, both 'VBD' (past tense verb) and 'VBG' (gerund or present participle) from Penn Treebank change to 'v' (verb) when used in the WordNet Lemmatizer.

Let's write a function for this purpose:

from nltk.corpus import wordnet as wn


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        # Default to noun if no match is found or starts with 'N'
        return wn.NOUN

This function simply checks the first letter of the Penn Treebank tag: if it's 'J', it returns the WordNet tag for adjectives; if 'V', for verbs; if 'R', for adverbs. For all other cases, including when the tag starts with 'N' or doesn't match any specified condition, it defaults to returning the WordNet tag for nouns.

The ADJ, VERB and other constants you see in the code are taken from WordNet. Their values are ADJ = 'a', ADJ_SAT = 's', ADV = 'r', NOUN = 'n', VERB = 'v'.

With this function, let's perform lemmatization with POS tagging:


              123456789101112131415161718192021222324252627282930313233
            
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()


# Function to map NLTK's POS tags to the format used by the WordNet lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        # Default to noun if no match is found or starts with 'N'
        return wn.NOUN  


text = "The leaves on the tree were turning a bright red, indicating that fall was leaving its mark."
text = text.lower()
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
# Lemmatize each token with its POS tag
lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens]
print("Original text:", text)
print("Lemmatized text:", ' '.join(lemmatized_tokens))

As you can see, we first performed POS tagging using the pos_tag() function, next we used list comprehension to create a list of lemmatized tokens by applying the lemmatize() method with the current token and correctly formatted tag (using our function get_wordnet_pos(tag)) as its arguments. We intentionally did not remove stop words to demonstrate that the code effectively processes all tokens.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 7