Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Tokenization Using Regular Expressions | Text Preprocessing Fundamentals
Introduction to NLP
course content

Contenido del Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

bookTokenization Using Regular Expressions

Why Regular Expressions?

While the word_tokenize() and sent_tokenize() functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).

In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.

Using regexp_tokenize()

Luckily, the NLTK library includes the regexp_tokenize() function in the tokenize module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.

The most important parameters of regexp_tokenize() are its first two parameters: text (the string to be tokenized) and pattern (regular expression pattern).

Let's take a look at an example:

123456
from nltk.tokenize import regexp_tokenize text = "Let's try, regex tokenization. Does it work? Yes, it does!" text = text.lower() # Tokenize a sentence tokens = regexp_tokenize(text, r'\w+') print(tokens)
copy

As you can see, the process is similar to using the word_tokenize() function, however, the results may vary depending on the pattern. In our example, the pattern '\w+' is used to match sequences of alphanumeric characters (letters and numbers), specifically one or more alphanumeric characters.

This results in a list of words without punctuation marks, which differs from word_tokenize() in that the latter typically includes punctuation as separate tokens. Thus, the output of our regexp_tokenize example would be a list of words from the sentence.

Using RegexpTokenizer

An alternative approach for custom tokenization involves using the RegexpTokenizer class from the NLTK's tokenize module. To begin, create an instance of RegexpTokenizer, providing it with your desired regular expression pattern as an argument; this pattern defines how the text will be tokenized.

Unlike the regexp_tokenize() function, you do not supply the text to be tokenized at the time of the RegexpTokenizer instance creation. Instead, once the instance is created with the specified pattern, you utilize its tokenize() method to apply the tokenization on your text, passing the text you wish to tokenize as an argument to this method.

Here is an example:

12345678
from nltk.tokenize import RegexpTokenizer # Define a tokenizer with a regular expression tokenizer = RegexpTokenizer(r'\w+') text = "Let's try, regex tokenization. Does it work? Yes, it does!" text = text.lower() # Tokenize a sentence tokens = tokenizer.tokenize(text) print(tokens)
copy

This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.

Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+' will search for one or more digits, as in the example below:

1234567
from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\d+') text = "Give my 100$ back right now or 20 each month" text = text.lower() # Tokenize a sentence tokens = tokenizer.tokenize(text) print(tokens)
copy

Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize(). In our example, when we wanted to use numbers as tokens, word_tokenize() would not be suitable for this task.

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 5
some-alt