Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Part of Speech Tagging | Stemming and Lemmatization
Introduction to NLP
course content

Зміст курсу

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

bookPart of Speech Tagging

Understading POS Tagging

We have mentioned in the previous chapter that part of speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail here.

However, using the full lexical terms can become quite cumbersome, especially with large corpus, short representations known as tags are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS with NLTK

In order to perform part of speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

Here is an example:

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

As you can see, this function returns a list of tuples each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK. This includes the trained model based on the averaged perceptron algorithm, enabling users to perform POS tagging on text data.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

Завдання

Your task is to perform part of speech tagging on the tokens of the text string and convert the resulting list of tuples to a DataFrame.

  1. Import the correct function for POS tagging.
  2. Download the model needed for NLTK's POS tagging.
  3. Perform POS tagging.
  4. Convert the result to DataFrame (do NOT transpose it).

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 4
toggle bottom row

bookPart of Speech Tagging

Understading POS Tagging

We have mentioned in the previous chapter that part of speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail here.

However, using the full lexical terms can become quite cumbersome, especially with large corpus, short representations known as tags are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS with NLTK

In order to perform part of speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

Here is an example:

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

As you can see, this function returns a list of tuples each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK. This includes the trained model based on the averaged perceptron algorithm, enabling users to perform POS tagging on text data.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

Завдання

Your task is to perform part of speech tagging on the tokens of the text string and convert the resulting list of tuples to a DataFrame.

  1. Import the correct function for POS tagging.
  2. Download the model needed for NLTK's POS tagging.
  3. Perform POS tagging.
  4. Convert the result to DataFrame (do NOT transpose it).

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 4
toggle bottom row

bookPart of Speech Tagging

Understading POS Tagging

We have mentioned in the previous chapter that part of speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail here.

However, using the full lexical terms can become quite cumbersome, especially with large corpus, short representations known as tags are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS with NLTK

In order to perform part of speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

Here is an example:

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

As you can see, this function returns a list of tuples each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK. This includes the trained model based on the averaged perceptron algorithm, enabling users to perform POS tagging on text data.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

Завдання

Your task is to perform part of speech tagging on the tokens of the text string and convert the resulting list of tuples to a DataFrame.

  1. Import the correct function for POS tagging.
  2. Download the model needed for NLTK's POS tagging.
  3. Perform POS tagging.
  4. Convert the result to DataFrame (do NOT transpose it).

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Understading POS Tagging

We have mentioned in the previous chapter that part of speech tagging is beneficial for lemmatization, which is its primary role in text preprocessing, so let's discuss this process in more detail here.

However, using the full lexical terms can become quite cumbersome, especially with large corpus, short representations known as tags are used instead. For example, "VB" instead of verb. In practice, however, different POS taggers may use a bit different tags and more detailed tags like "VBD" for verbs in past tense.

POS with NLTK

In order to perform part of speech tagging with NLTK, you should import the pos_tag() function directly from nltk and apply it on the list of strings (tokens) by passing it as the argument.

Here is an example:

123456789101112
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) print(tagged_tokens)
copy

As you can see, this function returns a list of tuples each containing a token and its tag. The line nltk.download('averaged_perceptron_tagger_eng') initiates the download of the dataset and models necessary for the PerceptronTagger, which is the default POS tagger used by NLTK. This includes the trained model based on the averaged perceptron algorithm, enabling users to perform POS tagging on text data.

For better visual representation, we could convert the result to a pandas DataFrame:

1234567891011121314
from nltk.tokenize import word_tokenize from nltk import pos_tag import nltk import pandas as pd nltk.download('punkt_tab') # Download the model needed for NLTK's POS tagging nltk.download('averaged_perceptron_tagger_eng') text = "One of the key NLP tasks is part of speech tagging" text = text.lower() tokens = word_tokenize(text) # Perform POS tagging tagged_tokens = pos_tag(tokens) # Convert to DataFrame print(pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']).T)
copy

Alternatively, we could use pd.DataFrame(tagged_tokens, columns=['Token', 'POS tag']) without transposing the DataFrame, so that each row represents a token-tag pair.

Завдання

Your task is to perform part of speech tagging on the tokens of the text string and convert the resulting list of tuples to a DataFrame.

  1. Import the correct function for POS tagging.
  2. Download the model needed for NLTK's POS tagging.
  3. Perform POS tagging.
  4. Convert the result to DataFrame (do NOT transpose it).

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 2. Розділ 4
Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
some-alt