Learn Tokenization | Text Preprocessing Fundamentals

Before actually diving into the process of tokenization, we have to first define what tokens are.

Definition

Tokens are independent and minimal text components, that have a specific syntax and semantics.

Consequently, tokenization is the process of splitting the text into tokens. For example, a paragraph of text, a text document or a text corpus consists of several components that can be divided into sentences, phrases, and words. In fact,the most popular tokenization methods include sentence and word tokenization, which is used to break a text document (or corpus) into sentences and each sentence into words.

Definition

A text corpus (plural: corpora) is a large and structured set of texts used in linguistic and computational linguistics research. Essentially, it's a comprehensive collection of written or spoken material that serves as a representative sample of a particular language, dialect, or subject area.

Sentence Tokenization

Let's start off with sentence tokenization. Luckily for us, nltk provides the sent_tokenize() function in the tokenize module. The primary purpose of this function is to split a given text into a list of sentences.

sent_tokenize() utilizes a pre-trained model, typically a machine learning model that has been trained on a large corpus of text, to identify the boundaries between sentences. It takes into consideration various cues in the text, such as punctuation marks (e.g., periods, exclamation points, question marks), capitalization, and other linguistic patterns that typically mark the end of one sentence and the beginning of another.


              123456789
            
# Importing the sent_tokenize() function
from nltk.tokenize import sent_tokenize
import nltk
# Downloading the "Punkt" tokenizer models
nltk.download('punkt_tab')
text = "Hello world. This is an example of sentence tokenization. NLTK makes it easy!"
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)

As you can see, there is nothing complicated here. You should simply pass a string with your text as an argument of sent_tokenize() to obtain a list of sentences. Speaking of nltk.download('punkt_tab'), this command specifically downloads the "Punkt" tokenizer models. By downloading the Punkt tokenizer models, you ensure that NLTK has the necessary data to perform accurate sentence and word tokenization.

Note

The punctuation marks at the end of each sentence are included in the sentence.

Word Tokenization

In word tokenization, there are several common methods to perform it; however, we'll discuss only the two most prevalent ones.

The most straightforward and simplest method is to use the split() function of the string class, which uses newline symbols, spaces, and tabs as delimiters by default. However, you can also pass an arbitrary string as its argument to serve as the delimiter.


              123456
            
text = "This is an example of word tokenization."
# Convert the text to lowercase
text = text.lower()
# Word tokenization using split()
words = text.split()
print(words)

Note

To ensure that tokens like 'This' and 'this' are treated as the same, it is important to convert the string to lowercase before tokenization.

A more flexible, approach, however, is to use the word_tokenize() function in the tokenize module of the nltk library. This function identifyies and separates words based on spaces and punctuation marks, effectively breaking down sentences into their constituent words. Similarly to sent_tokenize(), this function requires a string as its argument.

Let's compare this approach with using the split() method. The example below uses word_tokenize():


              12345678
            
from nltk import word_tokenize
import nltk
nltk.download('punkt_tab')
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks"
text = text.lower()
# Word tokenization using word_tokenize()
words = word_tokenize(text)
print(words)

Let's now see how the split() method performs with the same text:


              12345
            
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks"
text = text.lower()
# Word tokenization using split()
words = text.split()
print(words)

In our example, word_tokenize(), contrary to split(), accurately identifies punctuation and special characters as separate tokens. It correctly separates the dollar sign from the numeral and recognizes periods as standalone tokens. This nuanced tokenization is crucial for many NLP tasks, where the precise delineation of words and punctuation can significantly impact the analysis's accuracy and insights.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

Before actually diving into the process of tokenization, we have to first define what tokens are.

Definition

Tokens are independent and minimal text components, that have a specific syntax and semantics.

Definition

Sentence Tokenization


              123456789
            
# Importing the sent_tokenize() function
from nltk.tokenize import sent_tokenize
import nltk
# Downloading the "Punkt" tokenizer models
nltk.download('punkt_tab')
text = "Hello world. This is an example of sentence tokenization. NLTK makes it easy!"
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)

Note

The punctuation marks at the end of each sentence are included in the sentence.

Word Tokenization

In word tokenization, there are several common methods to perform it; however, we'll discuss only the two most prevalent ones.


              123456
            
text = "This is an example of word tokenization."
# Convert the text to lowercase
text = text.lower()
# Word tokenization using split()
words = text.split()
print(words)

Note

To ensure that tokens like 'This' and 'this' are treated as the same, it is important to convert the string to lowercase before tokenization.

Let's compare this approach with using the split() method. The example below uses word_tokenize():


              12345678
            
from nltk import word_tokenize
import nltk
nltk.download('punkt_tab')
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks"
text = text.lower()
# Word tokenization using word_tokenize()
words = word_tokenize(text)
print(words)

Let's now see how the split() method performs with the same text:


              12345
            
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks"
text = text.lower()
# Word tokenization using split()
words = text.split()
print(words)

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3