Зміст курсу
Introduction to NLP
Introduction to NLP
Tokenization Using Regular Expressions
Why Regular Expressions?
While the word_tokenize()
and sent_tokenize()
functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).
Regular expression (regex) is a sequence of characters that defines a search pattern. Regular expressions can be used for various text processing tasks, including searching, replacing, and splitting text based on specific patterns.
In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.
Using regexp_tokenize()
Luckily, the NLTK library includes the regexp_tokenize()
function in the tokenize
module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.
The most important parameters of regexp_tokenize()
are its first two parameters: text
(the string to be tokenized) and pattern
(regular expression pattern).
from nltk.tokenize import regexp_tokenize text = "Let's try, regex tokenization. Does it work? Yes, it does!" text = text.lower() # Tokenize a sentence tokens = regexp_tokenize(text, r'\w+') print(tokens)
As you can see, the process is similar to using the word_tokenize()
function, however, the results may vary depending on the pattern. In our example, the pattern '\w+'
is used to match sequences of one or more alphanumeric characters (letters and numbers) and underscores.
This results in a list of words without punctuation marks, which differs from word_tokenize()
in that the latter typically includes punctuation as separate tokens.
Using RegexpTokenizer
An alternative approach for custom tokenization involves using the RegexpTokenizer
class from the NLTK's tokenize
module. First, create an instance of RegexpTokenizer
with your desired regular expression pattern as an argument. Once an instance with specified pattern is created, you can pass your text as an argument to its tokenize()
method.
from nltk.tokenize import RegexpTokenizer # Define a tokenizer with a regular expression tokenizer = RegexpTokenizer(r'\w+') text = "Let's try, regex tokenization. Does it work? Yes, it does!" text = text.lower() # Tokenize a sentence tokens = tokenizer.tokenize(text) print(tokens)
This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.
Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+'
will search for one or more digits, as in the example below:
from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\d+') text = "Give my $100 back right now or $20 each month" text = text.lower() # Tokenize a sentence tokens = tokenizer.tokenize(text) print(tokens)
Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize()
. In our example, when we wanted to use numbers as tokens, word_tokenize()
would not be suitable for this task.
Дякуємо за ваш відгук!