Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Part of Speech Tagging | Stemming and Lemmatization
Introduction to NLP
course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Part of Speech Tagging

Understading POS Tagging

We have mentioned that part-of-speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail.

Note
Definition

Part of speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech (e.g., noun or verb), based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

Using full part-of-speech names (e.g. "verb" or "noun") can become quite cumbersome, especially with large corpus. That's why short representations, known as tags, are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS Tagging with NLTK

In order to perform part-of-speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

This function returns a list of tuples, each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK.

Note
Study More

This tagger is based on the averaged perceptron model, a supervised learning algorithm effective for large-scale text processing, including POS tagging. The PerceptronTagger is chosen for its balance of speed and accuracy, making it suitable for a wide range of NLP tasks that require POS tagging. It learns weights for features based on the training data it's given, and it uses these weights to predict the POS tags in unseen text.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

question mark

What is the goal of part-of-speech (POS) tagging in NLP?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Part of Speech Tagging

Understading POS Tagging

We have mentioned that part-of-speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail.

Note
Definition

Part of speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech (e.g., noun or verb), based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

Using full part-of-speech names (e.g. "verb" or "noun") can become quite cumbersome, especially with large corpus. That's why short representations, known as tags, are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS Tagging with NLTK

In order to perform part-of-speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

This function returns a list of tuples, each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK.

Note
Study More

This tagger is based on the averaged perceptron model, a supervised learning algorithm effective for large-scale text processing, including POS tagging. The PerceptronTagger is chosen for its balance of speed and accuracy, making it suitable for a wide range of NLP tasks that require POS tagging. It learns weights for features based on the training data it's given, and it uses these weights to predict the POS tags in unseen text.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

question mark

What is the goal of part-of-speech (POS) tagging in NLP?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5
some-alt