Kursinhalt
Generative AI
Generative AI
Data Collection and Preprocessing
Training generative models requires not only good architecture and loss functions, but also clean, well-structured, and diverse data. This section introduces high-quality datasets across vision, text, and audio modalities, provides detailed preprocessing techniques suited for modern generative pipelines, and discusses robust data splitting strategies with practical tooling.
Data Collection
Collecting data for generative modeling depends on the domain, source availability, scale, and licensing. For text and vision data, common sources include open datasets, scraped content, and structured repositories (e.g., academic archives, social media, or e-commerce platforms).
Web Scraping Techniques
When datasets are not readily available, data can be collected from the web using scraping tools. Web scraping allows you to programmatically extract information from HTML pages. It is a powerful approach to collecting real-world, unstructured data when APIs are unavailable. However, scraping comes with technical and ethical responsibilities.
Scraping methods typically involve:
Sending HTTP requests to retrieve web pages. This enables access to the raw HTML content of a page;
Parsing HTML content to extract structured data. Tools like BeautifulSoup convert unstructured HTML into accessible tags and elements;
Navigating dynamic pages using browser automation. JavaScript-heavy websites require tools like Selenium to fully render content;
Storing extracted data in usable formats like CSV or JSON. This ensures compatibility with later preprocessing and model training steps.
Below are two common scraping strategies:
Scraping Text with BeautifulSoup
BeautifulSoup
is a Python library used to parse static HTML pages.
import requests from bs4 import BeautifulSoup url = "https://docs.python.org/3/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") # Extract paragraphs paragraphs = [p.text for p in soup.find_all('p')] text = "\n".join(paragraphs) print(text)
Scraping Images with Selenium
Selenium
automates a browser to scrape content from pages rendered with JavaScript.
# INSTALL SELENIUM # THIS CODE DOWNLOAD IMAGES (I NEED JUST VIZUALIZATION) from selenium import webdriver import time import urllib.request url = "https://example.com/gallery" driver = webdriver.Chrome() driver.get(url) time.sleep(2) images = driver.find_elements("tag name", "img") for idx, img in enumerate(images): src = img.get_attribute('src') if src: urllib.request.urlretrieve(src, f"image_{idx}.jpg") driver.quit()
Always review a website’s terms of service before scraping. Use polite request rates and respect robots.txt. Improper scraping can lead to IP bans or legal consequences.
In GenAI contexts, web scraping is often a precursor to building pretraining datasets, particularly for domain-specific or low-resource languages. Tools like Scrapy
, playwright
, or browserless APIs are also frequently used for large-scale jobs.
Preprocessing Techniques
Data preprocessing must be tailored to the modality, model type, and quality constraints. For production-grade generative modeling, pipelines often include domain-specific transformations, resolution adaptation, and content-based filtering.
Image Preprocessing
Resizing: match dataset resolution to model input (e.g., 64x64 for early GANs, 512x512 for diffusion models);
Normalization: scales pixel values to a standard range, typically [−1, 1] or [0, 1];
Color Space Handling: ensure color consistency — convert to RGB or grayscale. For conditional generation, retain alpha channels if present;
Data augmentation: introduces variation during training via transformations.
Text Preprocessing
Cleaning: removes special characters, extra whitespace, and noise;
import re text = "Example text — with symbols!" cleaned = re.sub(r"[^\w\s]", "", text) cleaned = re.sub(r"\s+", " ", cleaned).strip() print(cleaned)
r"[^\w\s]"
:\w
: matches any alphanumeric character (letters A-Z, a-z, digits 0-9) and underscore_
;\s
: matches any whitespace character (spaces, tabs, newlines);[^...]
: a negated character class—matches anything not listed inside;Meaning: this pattern matches all characters except letters, digits, underscores, and whitespace. So it removes punctuation and symbols (like
—
,!
, etc.).
r"\s+"
:\s
: matches any whitespace character;+
: matches one or more of the preceding token;Meaning: this replaces multiple consecutive whitespace characters with a single space.
.strip()
: removes leading and trailing whitespace from the final cleaned string.
For more information on RegEx
syntax, refer to the documentation.
Lowercasing: standardizes text to lowercase for consistency. Apply selectively, because models like BERT are cased/uncased;
text = "This Is A Sentence." print(text.lower())
Tokenization: splits text into tokens or subwords for modeling;
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") encodings = tokenizer(["Example text."], padding="max_length", truncation=True, max_length=128, return_tensors="pt")
Stemming/Lemmatization: rare in deep learning pipelines but used in traditional NLP or pretraining filters;
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("running"))
Padding/Truncation: see example above with
max_length
.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = tokenizer("Short text.", max_length=10, padding="max_length", truncation=True, return_tensors="pt")
Build modular preprocessing pipelines with reproducibility in mind. Use DVC
, wandb artifacts
, or huggingface/datasets
with streaming + caching.
Data Splitting Strategies
Effective data splitting is essential for generalization, especially in generative modeling where overfitting to modes or memorization is common.
Train/Validation/Test Split
Conventional ratios: 80/10/10 or 70/15/15 depending on dataset size;
Content-aware splitting: stratify splits by class (vision), topic (text).
Use case:
Training: drives model optimization;
Validation: guides checkpointing, early stopping, and metric tuning (e.g., FID);
Test: held back completely until final model benchmarking.
Example using train_test_split
:
For Hugging Face Datasets:
Cross-Validation and Bootstrapping
For low-resource or zero-shot domains, use K-fold CV (e.g., K=5 or 10);
In diffusion models, use bootstrapped FID/LPIPS to evaluate generation stability;
Visual or perceptual inspection should accompany numerical validation.
Example K-fold setup:
Commonly Used Datasets
Choosing the right dataset depends on modality, data scale, licensing, and the specific generative objective (e.g., unconditional generation, conditional synthesis, or style transfer).
Computer Vision Datasets
CIFAR-10: 60,000 low-resolution 32×32 RGB images in 10 classes. Lightweight, ideal for rapid prototyping, unit testing, and benchmarking training loops for image GANs;
CelebA: 200K+ aligned celebrity faces annotated with 40 binary attributes. Frequently used in attribute-conditioned generation, identity-preserving face editing, and encoder-decoder models;
LSUN: large-scale scene dataset containing millions of images in categories like bedrooms, churches, and dining rooms. Essential for high-resolution synthesis and progressive GAN training;
ImageNet: over 14M high-quality images labeled across 20K classes. Used primarily for transfer learning, diffusion model pretraining, and as a base dataset for style-guided generation.
Text Datasets
WikiText: clean Wikipedia articles (WikiText-2: 2M tokens, WikiText-103: 100M+). Valuable for evaluating language modeling and fine-tuning decoder-only models like GPT;
BookCorpus: over 11,000 free novels. Critical for narrative-style generation, long-context transformers, and pretraining of foundational models (e.g., BERT, GPT-2);
Common Crawl / C4: petabyte-scale multilingual web data. C4 is a deduplicated, filtered variant curated for high-quality language model training (e.g., T5);
The Pile: 825GB of diverse data (books, ArXiv, StackExchange, GitHub, etc.). Designed to train GPT-style models competitively with OpenAI’s LLMs.
Summary
Choose datasets based on quality, licensing, scale, and alignment with generative goals;
Apply preprocessing pipelines tailored to each modality using robust, production-grade tools;
Ensure rigorous splitting strategies to support reproducibility, avoid leakage, and enable fair evaluation.
1. Why is data quality more important than quantity in training generative AI models?
2. What is one common challenge when collecting diverse data for training generative models?
3. What is the primary goal of data augmentation in the context of generative AI training?
Danke für Ihr Feedback!