Lära Text Normalization Techniques | Advanced Text Cleaning

Svep för att visa menyn

Text normalization is a crucial step in preparing textual data for analysis, ensuring that variations in formatting and structure do not interfere with downstream tasks. You often encounter inconsistencies such as different cases, punctuation, and word forms in raw text data. Common normalization techniques include lowercasing, which converts all characters to lower case to ensure uniformity; removing punctuation, which eliminates non-alphanumeric symbols that may not contribute meaningfully to analysis; stemming, which reduces words to their root form by chopping off derivational affixes; and lemmatization, which maps words to their base or dictionary form, considering the context and part of speech.

Advanced text normalization techniques further improve the quality and consistency of textual data:

Context-aware lemmatization: uses linguistic context and part-of-speech information to accurately map words to their correct dictionary form (for example, "better" becomes "good" only when recognized as an adjective);
Part-of-speech tagging: assigns grammatical categories to each word, allowing normalization processes to distinguish between words like "running" as a noun or verb;
Handling compound words: separates or standardizes terms such as "e-mail" and "email" or merges split words like "data base" into "database" for consistency;
Use of domain-specific lexicons: applies specialized dictionaries to resolve terms unique to a particular field, such as medical abbreviations or legal jargon, ensuring accurate normalization for domain-relevant vocabulary.

Applying both fundamental and advanced normalization techniques helps standardize the data, making it easier to compare, group, and analyze textual entries while preserving important contextual meaning.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
            
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required tokenizer + WordNet
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

texts = [
    "Running quickly!",
    "He runs faster.",
    "Runner's motivation is high.",
    "They have been running."
]

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

normalized_results = []
for text in texts:
    # Lowercase
    text_lower = text.lower()

    # Remove punctuation
    cleaned = re.sub(r"[^\w\s]", "", text_lower)

    # Tokenize
    tokens = word_tokenize(cleaned)

    # Stemming
    stems = [stemmer.stem(token) for token in tokens]

    # Lemmatization (NO POS — defaults to noun)
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]

    normalized_results.append({
        'original': text,
        'cleaned': cleaned,
        'stemmed': ' '.join(stems),
        'lemmatized': ' '.join(lemmas)
    })

for result in normalized_results:
    print(f"Original: {result['original']}")
    print(f"Cleaned: {result['cleaned']}")
    print(f"Stemmed: {result['stemmed']}")
    print(f"Lemmatized: {result['lemmatized']}\n")

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 1

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 4. Kapitel 1