Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Data Collection and Preprocessing | Building and Training Generative Models
Generative AI
course content

Kursusindhold

Generative AI

Generative AI

1. Introduction to Generative AI
2. Theoretical Foundations
3. Building and Training Generative Models
4. Ethical, Regulatory, and Future Perspectives in Generative AI

book
Data Collection and Preprocessing

Training generative models requires not only good architecture and loss functions, but also clean, well-structured, and diverse data. This section introduces high-quality datasets across vision, text, and audio modalities, provides detailed preprocessing techniques suited for modern generative pipelines, and discusses robust data splitting strategies with practical tooling.

Data Collection

Collecting data for generative modeling depends on the domain, source availability, scale, and licensing. For text and vision data, common sources include open datasets, scraped content, and structured repositories (e.g., academic archives, social media, or e-commerce platforms).

Web Scraping Techniques

When datasets are not readily available, data can be collected from the web using scraping tools. Web scraping allows you to programmatically extract information from HTML pages. It is a powerful approach to collecting real-world, unstructured data when APIs are unavailable. However, scraping comes with technical and ethical responsibilities.

Scraping methods typically involve:

  • Sending HTTP requests to retrieve web pages. This enables access to the raw HTML content of a page;

  • Parsing HTML content to extract structured data. Tools like BeautifulSoup convert unstructured HTML into accessible tags and elements;

  • Navigating dynamic pages using browser automation. JavaScript-heavy websites require tools like Selenium to fully render content;

  • Storing extracted data in usable formats like CSV or JSON. This ensures compatibility with later preprocessing and model training steps.

Below are two common scraping strategies:

Scraping Text with BeautifulSoup

BeautifulSoup is a Python library used to parse static HTML pages.

1234567891011
import requests from bs4 import BeautifulSoup url = "https://docs.python.org/3/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") # Extract paragraphs paragraphs = [p.text for p in soup.find_all('p')] text = "\n".join(paragraphs) print(text)
copy

Scraping Images with Selenium

Selenium automates a browser to scrape content from pages rendered with JavaScript.

123456789101112131415161718
# INSTALL SELENIUM # THIS CODE DOWNLOAD IMAGES (I NEED JUST VIZUALIZATION) from selenium import webdriver import time import urllib.request url = "https://example.com/gallery" driver = webdriver.Chrome() driver.get(url) time.sleep(2) images = driver.find_elements("tag name", "img") for idx, img in enumerate(images): src = img.get_attribute('src') if src: urllib.request.urlretrieve(src, f"image_{idx}.jpg") driver.quit()
copy
Note
Note

Always review a website’s terms of service before scraping. Use polite request rates and respect robots.txt. Improper scraping can lead to IP bans or legal consequences.

In GenAI contexts, web scraping is often a precursor to building pretraining datasets, particularly for domain-specific or low-resource languages. Tools like Scrapy, playwright, or browserless APIs are also frequently used for large-scale jobs.

Preprocessing Techniques

Data preprocessing must be tailored to the modality, model type, and quality constraints. For production-grade generative modeling, pipelines often include domain-specific transformations, resolution adaptation, and content-based filtering.

Image Preprocessing

  • Resizing: match dataset resolution to model input (e.g., 64x64 for early GANs, 512x512 for diffusion models);

  • Normalization: scales pixel values to a standard range, typically [−1, 1] or [0, 1];

  • Color Space Handling: ensure color consistency — convert to RGB or grayscale. For conditional generation, retain alpha channels if present;

  • Data augmentation: introduces variation during training via transformations.

Text Preprocessing

  • Cleaning: removes special characters, extra whitespace, and noise;

12345
import re text = "Example text — with symbols!" cleaned = re.sub(r"[^\w\s]", "", text) cleaned = re.sub(r"\s+", " ", cleaned).strip() print(cleaned)
copy
  1. r"[^\w\s]":

    • \w: matches any alphanumeric character (letters A-Z, a-z, digits 0-9) and underscore _;

    • \s: matches any whitespace character (spaces, tabs, newlines);

    • [^...]: a negated character class—matches anything not listed inside;

    • Meaning: this pattern matches all characters except letters, digits, underscores, and whitespace. So it removes punctuation and symbols (like , !, etc.).

  2. r"\s+":

    • \s: matches any whitespace character;

    • +: matches one or more of the preceding token;

    • Meaning: this replaces multiple consecutive whitespace characters with a single space.

  3. .strip(): removes leading and trailing whitespace from the final cleaned string.

For more information on RegEx syntax, refer to the documentation.

  • Lowercasing: standardizes text to lowercase for consistency. Apply selectively, because models like BERT are cased/uncased;

12
text = "This Is A Sentence." print(text.lower())
copy
  • Tokenization: splits text into tokens or subwords for modeling;

1234
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") encodings = tokenizer(["Example text."], padding="max_length", truncation=True, max_length=128, return_tensors="pt")
copy
  • Stemming/Lemmatization: rare in deep learning pipelines but used in traditional NLP or pretraining filters;

1234
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("running"))
copy
  • Padding/Truncation: see example above with max_length.

1234
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = tokenizer("Short text.", max_length=10, padding="max_length", truncation=True, return_tensors="pt")
copy
Note
Note

Build modular preprocessing pipelines with reproducibility in mind. Use DVC, wandb artifacts, or huggingface/datasets with streaming + caching.

Data Splitting Strategies

Effective data splitting is essential for generalization, especially in generative modeling where overfitting to modes or memorization is common.

Train/Validation/Test Split

  • Conventional ratios: 80/10/10 or 70/15/15 depending on dataset size;

  • Content-aware splitting: stratify splits by class (vision), topic (text).

  • Use case:

    • Training: drives model optimization;

    • Validation: guides checkpointing, early stopping, and metric tuning (e.g., FID);

    • Test: held back completely until final model benchmarking.

Example using train_test_split:

For Hugging Face Datasets:

Cross-Validation and Bootstrapping

  • For low-resource or zero-shot domains, use K-fold CV (e.g., K=5 or 10);

  • In diffusion models, use bootstrapped FID/LPIPS to evaluate generation stability;

  • Visual or perceptual inspection should accompany numerical validation.

Example K-fold setup:

Commonly Used Datasets

Choosing the right dataset depends on modality, data scale, licensing, and the specific generative objective (e.g., unconditional generation, conditional synthesis, or style transfer).

Computer Vision Datasets

  • CIFAR-10: 60,000 low-resolution 32×32 RGB images in 10 classes. Lightweight, ideal for rapid prototyping, unit testing, and benchmarking training loops for image GANs;

  • CelebA: 200K+ aligned celebrity faces annotated with 40 binary attributes. Frequently used in attribute-conditioned generation, identity-preserving face editing, and encoder-decoder models;

  • LSUN: large-scale scene dataset containing millions of images in categories like bedrooms, churches, and dining rooms. Essential for high-resolution synthesis and progressive GAN training;

  • ImageNet: over 14M high-quality images labeled across 20K classes. Used primarily for transfer learning, diffusion model pretraining, and as a base dataset for style-guided generation.

Text Datasets

  • WikiText: clean Wikipedia articles (WikiText-2: 2M tokens, WikiText-103: 100M+). Valuable for evaluating language modeling and fine-tuning decoder-only models like GPT;

  • BookCorpus: over 11,000 free novels. Critical for narrative-style generation, long-context transformers, and pretraining of foundational models (e.g., BERT, GPT-2);

  • Common Crawl / C4: petabyte-scale multilingual web data. C4 is a deduplicated, filtered variant curated for high-quality language model training (e.g., T5);

  • The Pile: 825GB of diverse data (books, ArXiv, StackExchange, GitHub, etc.). Designed to train GPT-style models competitively with OpenAI’s LLMs.

Summary

  • Choose datasets based on quality, licensing, scale, and alignment with generative goals;

  • Apply preprocessing pipelines tailored to each modality using robust, production-grade tools;

  • Ensure rigorous splitting strategies to support reproducibility, avoid leakage, and enable fair evaluation.

1. Why is data quality more important than quantity in training generative AI models?

2. What is one common challenge when collecting diverse data for training generative models?

3. What is the primary goal of data augmentation in the context of generative AI training?

question mark

Why is data quality more important than quantity in training generative AI models?

Select the correct answer

question mark

What is one common challenge when collecting diverse data for training generative models?

Select the correct answer

question mark

What is the primary goal of data augmentation in the context of generative AI training?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 3. Kapitel 1

Spørg AI

expand
ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

course content

Kursusindhold

Generative AI

Generative AI

1. Introduction to Generative AI
2. Theoretical Foundations
3. Building and Training Generative Models
4. Ethical, Regulatory, and Future Perspectives in Generative AI

book
Data Collection and Preprocessing

Training generative models requires not only good architecture and loss functions, but also clean, well-structured, and diverse data. This section introduces high-quality datasets across vision, text, and audio modalities, provides detailed preprocessing techniques suited for modern generative pipelines, and discusses robust data splitting strategies with practical tooling.

Data Collection

Collecting data for generative modeling depends on the domain, source availability, scale, and licensing. For text and vision data, common sources include open datasets, scraped content, and structured repositories (e.g., academic archives, social media, or e-commerce platforms).

Web Scraping Techniques

When datasets are not readily available, data can be collected from the web using scraping tools. Web scraping allows you to programmatically extract information from HTML pages. It is a powerful approach to collecting real-world, unstructured data when APIs are unavailable. However, scraping comes with technical and ethical responsibilities.

Scraping methods typically involve:

  • Sending HTTP requests to retrieve web pages. This enables access to the raw HTML content of a page;

  • Parsing HTML content to extract structured data. Tools like BeautifulSoup convert unstructured HTML into accessible tags and elements;

  • Navigating dynamic pages using browser automation. JavaScript-heavy websites require tools like Selenium to fully render content;

  • Storing extracted data in usable formats like CSV or JSON. This ensures compatibility with later preprocessing and model training steps.

Below are two common scraping strategies:

Scraping Text with BeautifulSoup

BeautifulSoup is a Python library used to parse static HTML pages.

1234567891011
import requests from bs4 import BeautifulSoup url = "https://docs.python.org/3/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") # Extract paragraphs paragraphs = [p.text for p in soup.find_all('p')] text = "\n".join(paragraphs) print(text)
copy

Scraping Images with Selenium

Selenium automates a browser to scrape content from pages rendered with JavaScript.

123456789101112131415161718
# INSTALL SELENIUM # THIS CODE DOWNLOAD IMAGES (I NEED JUST VIZUALIZATION) from selenium import webdriver import time import urllib.request url = "https://example.com/gallery" driver = webdriver.Chrome() driver.get(url) time.sleep(2) images = driver.find_elements("tag name", "img") for idx, img in enumerate(images): src = img.get_attribute('src') if src: urllib.request.urlretrieve(src, f"image_{idx}.jpg") driver.quit()
copy
Note
Note

Always review a website’s terms of service before scraping. Use polite request rates and respect robots.txt. Improper scraping can lead to IP bans or legal consequences.

In GenAI contexts, web scraping is often a precursor to building pretraining datasets, particularly for domain-specific or low-resource languages. Tools like Scrapy, playwright, or browserless APIs are also frequently used for large-scale jobs.

Preprocessing Techniques

Data preprocessing must be tailored to the modality, model type, and quality constraints. For production-grade generative modeling, pipelines often include domain-specific transformations, resolution adaptation, and content-based filtering.

Image Preprocessing

  • Resizing: match dataset resolution to model input (e.g., 64x64 for early GANs, 512x512 for diffusion models);

  • Normalization: scales pixel values to a standard range, typically [−1, 1] or [0, 1];

  • Color Space Handling: ensure color consistency — convert to RGB or grayscale. For conditional generation, retain alpha channels if present;

  • Data augmentation: introduces variation during training via transformations.

Text Preprocessing

  • Cleaning: removes special characters, extra whitespace, and noise;

12345
import re text = "Example text — with symbols!" cleaned = re.sub(r"[^\w\s]", "", text) cleaned = re.sub(r"\s+", " ", cleaned).strip() print(cleaned)
copy
  1. r"[^\w\s]":

    • \w: matches any alphanumeric character (letters A-Z, a-z, digits 0-9) and underscore _;

    • \s: matches any whitespace character (spaces, tabs, newlines);

    • [^...]: a negated character class—matches anything not listed inside;

    • Meaning: this pattern matches all characters except letters, digits, underscores, and whitespace. So it removes punctuation and symbols (like , !, etc.).

  2. r"\s+":

    • \s: matches any whitespace character;

    • +: matches one or more of the preceding token;

    • Meaning: this replaces multiple consecutive whitespace characters with a single space.

  3. .strip(): removes leading and trailing whitespace from the final cleaned string.

For more information on RegEx syntax, refer to the documentation.

  • Lowercasing: standardizes text to lowercase for consistency. Apply selectively, because models like BERT are cased/uncased;

12
text = "This Is A Sentence." print(text.lower())
copy
  • Tokenization: splits text into tokens or subwords for modeling;

1234
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") encodings = tokenizer(["Example text."], padding="max_length", truncation=True, max_length=128, return_tensors="pt")
copy
  • Stemming/Lemmatization: rare in deep learning pipelines but used in traditional NLP or pretraining filters;

1234
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("running"))
copy
  • Padding/Truncation: see example above with max_length.

1234
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = tokenizer("Short text.", max_length=10, padding="max_length", truncation=True, return_tensors="pt")
copy
Note
Note

Build modular preprocessing pipelines with reproducibility in mind. Use DVC, wandb artifacts, or huggingface/datasets with streaming + caching.

Data Splitting Strategies

Effective data splitting is essential for generalization, especially in generative modeling where overfitting to modes or memorization is common.

Train/Validation/Test Split

  • Conventional ratios: 80/10/10 or 70/15/15 depending on dataset size;

  • Content-aware splitting: stratify splits by class (vision), topic (text).

  • Use case:

    • Training: drives model optimization;

    • Validation: guides checkpointing, early stopping, and metric tuning (e.g., FID);

    • Test: held back completely until final model benchmarking.

Example using train_test_split:

For Hugging Face Datasets:

Cross-Validation and Bootstrapping

  • For low-resource or zero-shot domains, use K-fold CV (e.g., K=5 or 10);

  • In diffusion models, use bootstrapped FID/LPIPS to evaluate generation stability;

  • Visual or perceptual inspection should accompany numerical validation.

Example K-fold setup:

Commonly Used Datasets

Choosing the right dataset depends on modality, data scale, licensing, and the specific generative objective (e.g., unconditional generation, conditional synthesis, or style transfer).

Computer Vision Datasets

  • CIFAR-10: 60,000 low-resolution 32×32 RGB images in 10 classes. Lightweight, ideal for rapid prototyping, unit testing, and benchmarking training loops for image GANs;

  • CelebA: 200K+ aligned celebrity faces annotated with 40 binary attributes. Frequently used in attribute-conditioned generation, identity-preserving face editing, and encoder-decoder models;

  • LSUN: large-scale scene dataset containing millions of images in categories like bedrooms, churches, and dining rooms. Essential for high-resolution synthesis and progressive GAN training;

  • ImageNet: over 14M high-quality images labeled across 20K classes. Used primarily for transfer learning, diffusion model pretraining, and as a base dataset for style-guided generation.

Text Datasets

  • WikiText: clean Wikipedia articles (WikiText-2: 2M tokens, WikiText-103: 100M+). Valuable for evaluating language modeling and fine-tuning decoder-only models like GPT;

  • BookCorpus: over 11,000 free novels. Critical for narrative-style generation, long-context transformers, and pretraining of foundational models (e.g., BERT, GPT-2);

  • Common Crawl / C4: petabyte-scale multilingual web data. C4 is a deduplicated, filtered variant curated for high-quality language model training (e.g., T5);

  • The Pile: 825GB of diverse data (books, ArXiv, StackExchange, GitHub, etc.). Designed to train GPT-style models competitively with OpenAI’s LLMs.

Summary

  • Choose datasets based on quality, licensing, scale, and alignment with generative goals;

  • Apply preprocessing pipelines tailored to each modality using robust, production-grade tools;

  • Ensure rigorous splitting strategies to support reproducibility, avoid leakage, and enable fair evaluation.

1. Why is data quality more important than quantity in training generative AI models?

2. What is one common challenge when collecting diverse data for training generative models?

3. What is the primary goal of data augmentation in the context of generative AI training?

question mark

Why is data quality more important than quantity in training generative AI models?

Select the correct answer

question mark

What is one common challenge when collecting diverse data for training generative models?

Select the correct answer

question mark

What is the primary goal of data augmentation in the context of generative AI training?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 3. Kapitel 1
Vi beklager, at noget gik galt. Hvad skete der?
some-alt