Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Tokenization | Text Preprocessing Fundamentals
Introduction to NLP
course content

Зміст курсу

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

bookTokenization

Before actually diving into the process of tokenization, we have to first define what tokens are.

Consequently, tokenization is the process of splitting the text into tokens. For example, a paragraph of text, a text document or a text corpus consists of several components that can be divided into sentences, phrases, and words. In fact,the most popular tokenization methods include sentence and word tokenization, which is used to break a text document (or corpus) into sentences and each sentence into words.

Sentence Tokenization

Let's start off with sentence tokenization. Luckily for us, nltk provides the sent_tokenize() function in the tokenize module. The primary purpose of this function is to split a given text into a list of sentences.

sent_tokenize() utilizes a pre-trained model, typically a machine learning model that has been trained on a large corpus of text, to identify the boundaries between sentences. It takes into consideration various cues in the text, such as punctuation marks (e.g., periods, exclamation points, question marks), capitalization, and other linguistic patterns that typically mark the end of one sentence and the beginning of another.

Let's take a look at an example to make things clear:

123456789
# Importing the sent_tokenize() function from nltk.tokenize import sent_tokenize import nltk # Downloading the "Punkt" tokenizer models nltk.download('punkt_tab') text = "Hello world. This is an example of sentence tokenization. NLTK makes it easy!" # Sentence tokenization sentences = sent_tokenize(text) print(sentences)
copy

As you can see, there is nothing complicated here. You should simply pass a string with your text as an argument of sent_tokenize() to obtain a list of sentences. Speaking of nltk.download('punkt_tab'), this command specifically downloads the "Punkt" tokenizer models. By downloading the Punkt tokenizer models, you ensure that NLTK has the necessary data to perform accurate sentence and word tokenization.

Word Tokenization

In word tokenization, there are several common methods to perform it; however, in this chapter, we'll discuss only the two most prevalent ones.

The most straightforward and simplest method is to use the split() function of the string class, which uses newline symbols, spaces, and tabs as delimiters by default. However, you can also pass an arbitrary string as its argument to serve as the delimiter.

Here is an example:

123456
text = "This is an example of word tokenization." # Convert the text to lowercase text = text.lower() # Word tokenization using split() words = text.split() print(words)
copy

A more flexible, approach, however, is to use the word_tokenize() function in the tokenize module of the nltk library. This function identifyies and separates words based on spaces and punctuation marks, effectively breaking down sentences into their constituent words. Similarly to sent_tokenize(), this function requires a string as its argument.

Let's compare this approach with using the split() method. The example below uses word_tokenize():

12345678
from nltk import word_tokenize import nltk nltk.download('punkt_tab') text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks" text = text.lower() # Word tokenization using word_tokenize() words = word_tokenize(text) print(words)
copy

Let's now see how the split() method performs with the same text:

12345
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks" text = text.lower() # Word tokenization using split() words = text.split() print(words)
copy

In our example, word_tokenize(), contrary to split(), accurately identifies punctuation and special characters as separate tokens. It correctly separates the dollar sign from the numeral and recognizes periods as standalone tokens. This nuanced tokenization is crucial for many NLP tasks, where the precise delineation of words and punctuation can significantly impact the analysis's accuracy and insights.

Given the sentence "It wasn't me, I swear!", what will be the result of applying the `split()` method on it?

Given the sentence "It wasn't me, I swear!", what will be the result of applying the split() method on it?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3
some-alt