Tokenization Using Regular Expressions

Why Regular Expressions?

While the word_tokenize() and sent_tokenize() functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).

Definition

Regular expression (regex) is a sequence of characters that defines a search pattern. Regular expressions can be used for various text processing tasks, including searching, replacing, and splitting text based on specific patterns.

In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.

Using regexp_tokenize()

Luckily, the NLTK library includes the regexp_tokenize() function in the tokenize module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.

The most important parameters of regexp_tokenize() are its first two parameters: text (the string to be tokenized) and pattern (regular expression pattern).


              123456
            
from nltk.tokenize import regexp_tokenize
text = "Let's try, regex tokenization. Does it work? Yes, it does!"
text = text.lower()
# Tokenize a sentence
tokens = regexp_tokenize(text, r'\w+')
print(tokens)

As you can see, the process is similar to using the word_tokenize() function, however, the results may vary depending on the pattern. In our example, the pattern '\w+' is used to match sequences of one or more alphanumeric characters (letters and numbers) and underscores.

This results in a list of words without punctuation marks, which differs from word_tokenize() in that the latter typically includes punctuation as separate tokens.

Using RegexpTokenizer

An alternative approach for custom tokenization involves using the RegexpTokenizer class from the NLTK's tokenize module. First, create an instance of RegexpTokenizer with your desired regular expression pattern as an argument. Once an instance with specified pattern is created, you can pass your text as an argument to its tokenize() method.


              12345678
            
from nltk.tokenize import RegexpTokenizer
# Define a tokenizer with a regular expression
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's try, regex tokenization. Does it work? Yes, it does!"
text = text.lower()
# Tokenize a sentence
tokens = tokenizer.tokenize(text)
print(tokens)

This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.

Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+' will search for one or more digits, as in the example below:


              1234567
            
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\d+')
text = "Give my $100 back right now or $20 each month"
text = text.lower()
# Tokenize a sentence
tokens = tokenizer.tokenize(text)
print(tokens)

Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize(). In our example, when we wanted to use numbers as tokens, word_tokenize() would not be suitable for this task.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 5

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Зміст курсу

Introduction to NLP