Course Content
Introduction to NLP
Introduction to NLP
Tokenization
Before actually diving into the process of tokenization, we have to first define what tokens are.
Consequently, tokenization is the process of splitting the text into tokens. For example, a paragraph of text, a text document or a text corpus consists of several components that can be divided into sentences, phrases, and words. In fact,the most popular tokenization methods include sentence and word tokenization, which is used to break a text document (or corpus) into sentences and each sentence into words.
Sentence Tokenization
Let's start off with sentence tokenization. Luckily for us, nltk
provides the sent_tokenize()
function in the tokenize
module. The primary purpose of this function is to split a given text into a list of sentences.
sent_tokenize()
utilizes a pre-trained model, typically a machine learning model that has been trained on a large corpus of text, to identify the boundaries between sentences. It takes into consideration various cues in the text, such as punctuation marks (e.g., periods, exclamation points, question marks), capitalization, and other linguistic patterns that typically mark the end of one sentence and the beginning of another.
Let's take a look at an example to make things clear:
# Importing the sent_tokenize() function from nltk.tokenize import sent_tokenize import nltk # Downloading the "Punkt" tokenizer models nltk.download('punkt_tab') text = "Hello world. This is an example of sentence tokenization. NLTK makes it easy!" # Sentence tokenization sentences = sent_tokenize(text) print(sentences)
As you can see, there is nothing complicated here. You should simply pass a string with your text as an argument of sent_tokenize()
to obtain a list of sentences. Speaking of nltk.download('punkt_tab')
, this command specifically downloads the "Punkt" tokenizer models. By downloading the Punkt tokenizer models, you ensure that NLTK has the necessary data to perform accurate sentence and word tokenization.
Word Tokenization
In word tokenization, there are several common methods to perform it; however, in this chapter, we'll discuss only the two most prevalent ones.
The most straightforward and simplest method is to use the split()
function of the string class, which uses newline symbols, spaces, and tabs as delimiters by default. However, you can also pass an arbitrary string as its argument to serve as the delimiter.
Here is an example:
text = "This is an example of word tokenization." # Convert the text to lowercase text = text.lower() # Word tokenization using split() words = text.split() print(words)
A more flexible, approach, however, is to use the word_tokenize()
function in the tokenize
module of the nltk
library. This function identifyies and separates words based on spaces and punctuation marks, effectively breaking down sentences into their constituent words. Similarly to sent_tokenize()
, this function requires a string as its argument.
Let's compare this approach with using the split()
method. The example below uses word_tokenize()
:
from nltk import word_tokenize import nltk nltk.download('punkt_tab') text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks" text = text.lower() # Word tokenization using word_tokenize() words = word_tokenize(text) print(words)
Let's now see how the split()
method performs with the same text:
text = "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks" text = text.lower() # Word tokenization using split() words = text.split() print(words)
In our example, word_tokenize()
, contrary to split()
, accurately identifies punctuation and special characters as separate tokens. It correctly separates the dollar sign from the numeral and recognizes periods as standalone tokens. This nuanced tokenization is crucial for many NLP tasks, where the precise delineation of words and punctuation can significantly impact the analysis's accuracy and insights.
Thanks for your feedback!