Kursusindhold

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

Probability Distributions and Randomness in AI Bayesian Inference and Markov Processes Understanding Information and Optimization in AI Overview of Artificial Neural Networks Recurrent Neural Networks (RNNs) and Sequence Generation Variational Autoencoders (VAEs)Generative Adversarial Networks (GANs)Transformer-Based Generative Models Diffusion Models and Probabilistic Generative Approaches

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Data Collection and Preprocessing

Training generative models requires not only good architecture and loss functions, but also clean, well-structured, and diverse data. This section introduces high-quality datasets across vision, text, and audio modalities, provides detailed preprocessing techniques suited for modern generative pipelines, and discusses robust data splitting strategies with practical tooling.

Data Collection

Collecting data for generative modeling depends on the domain, source availability, scale, and licensing. For text and vision data, common sources include open datasets, scraped content, and structured repositories (e.g., academic archives, social media, or e-commerce platforms).

Web Scraping Techniques

When datasets are not readily available, data can be collected from the web using scraping tools. Web scraping allows you to programmatically extract information from HTML pages. It is a powerful approach to collecting real-world, unstructured data when APIs are unavailable. However, scraping comes with technical and ethical responsibilities.

Scraping methods typically involve:

Sending HTTP requests to retrieve web pages. This enables access to the raw HTML content of a page;
Parsing HTML content to extract structured data. Tools like BeautifulSoup convert unstructured HTML into accessible tags and elements;
Navigating dynamic pages using browser automation. JavaScript-heavy websites require tools like Selenium to fully render content;
Storing extracted data in usable formats like CSV or JSON. This ensures compatibility with later preprocessing and model training steps.

Below are two common scraping strategies:

Scraping Text with BeautifulSoup

BeautifulSoup is a Python library used to parse static HTML pages.


              1234567891011
            
import requests
from bs4 import BeautifulSoup

url = "https://docs.python.org/3/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract paragraphs
paragraphs = [p.text for p in soup.find_all('p')]
text = "\n".join(paragraphs)
print(text)

Scraping Images with Selenium

Selenium automates a browser to scrape content from pages rendered with JavaScript.


              123456789101112131415161718
            
# INSTALL SELENIUM
# THIS CODE DOWNLOAD IMAGES (I NEED JUST VIZUALIZATION)

from selenium import webdriver
import time
import urllib.request

url = "https://example.com/gallery"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)

images = driver.find_elements("tag name", "img")
for idx, img in enumerate(images):
    src = img.get_attribute('src')
    if src:
        urllib.request.urlretrieve(src, f"image_{idx}.jpg")
driver.quit()

Note

Always review a website’s terms of service before scraping. Use polite request rates and respect robots.txt. Improper scraping can lead to IP bans or legal consequences.

In GenAI contexts, web scraping is often a precursor to building pretraining datasets, particularly for domain-specific or low-resource languages. Tools like Scrapy, playwright, or browserless APIs are also frequently used for large-scale jobs.

Preprocessing Techniques

Data preprocessing must be tailored to the modality, model type, and quality constraints. For production-grade generative modeling, pipelines often include domain-specific transformations, resolution adaptation, and content-based filtering.

Image Preprocessing

Resizing: match dataset resolution to model input (e.g., 64x64 for early GANs, 512x512 for diffusion models);

Normalization: scales pixel values to a standard range, typically [−1, 1] or [0, 1];

Color Space Handling: ensure color consistency — convert to RGB or grayscale. For conditional generation, retain alpha channels if present;

Data augmentation: introduces variation during training via transformations.

Text Preprocessing

Cleaning: removes special characters, extra whitespace, and noise;


              12345
            
import re
text = "Example   text — with  symbols!"
cleaned = re.sub(r"[^\w\s]", "", text)
cleaned = re.sub(r"\s+", " ", cleaned).strip()
print(cleaned)

r"[^\w\s]":
- \w: matches any alphanumeric character (letters A-Z, a-z, digits 0-9) and underscore _;
- \s: matches any whitespace character (spaces, tabs, newlines);
- [^...]: a negated character class—matches anything not listed inside;
- Meaning: this pattern matches all characters except letters, digits, underscores, and whitespace. So it removes punctuation and symbols (like —, !, etc.).
r"\s+":
- \s: matches any whitespace character;
- +: matches one or more of the preceding token;
- Meaning: this replaces multiple consecutive whitespace characters with a single space.
.strip(): removes leading and trailing whitespace from the final cleaned string.

For more information on RegEx syntax, refer to the documentation.

Lowercasing: standardizes text to lowercase for consistency. Apply selectively, because models like BERT are cased/uncased;


              12
            
text = "This Is A Sentence."
print(text.lower())

Tokenization: splits text into tokens or subwords for modeling;


              1234
            
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
encodings = tokenizer(["Example text."], padding="max_length", truncation=True, max_length=128, return_tensors="pt")

Stemming/Lemmatization: rare in deep learning pipelines but used in traditional NLP or pretraining filters;


              1234
            
from nltk.stem import PorterStemmer
  
stemmer = PorterStemmer()
print(stemmer.stem("running"))

Padding/Truncation: see example above with max_length.


              1234
            
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Short text.", max_length=10, padding="max_length", truncation=True, return_tensors="pt")

Note

Build modular preprocessing pipelines with reproducibility in mind. Use DVC, wandb artifacts, or huggingface/datasets with streaming + caching.

Data Splitting Strategies

Effective data splitting is essential for generalization, especially in generative modeling where overfitting to modes or memorization is common.

Train/Validation/Test Split

Conventional ratios: 80/10/10 or 70/15/15 depending on dataset size;
Content-aware splitting: stratify splits by class (vision), topic (text).
Use case:
- Training: drives model optimization;
- Validation: guides checkpointing, early stopping, and metric tuning (e.g., FID);
- Test: held back completely until final model benchmarking.

Example using train_test_split:

For Hugging Face Datasets:

Cross-Validation and Bootstrapping

For low-resource or zero-shot domains, use K-fold CV (e.g., K=5 or 10);
In diffusion models, use bootstrapped FID/LPIPS to evaluate generation stability;
Visual or perceptual inspection should accompany numerical validation.

Example K-fold setup:

Commonly Used Datasets

Choosing the right dataset depends on modality, data scale, licensing, and the specific generative objective (e.g., unconditional generation, conditional synthesis, or style transfer).

Computer Vision Datasets

CIFAR-10: 60,000 low-resolution 32×32 RGB images in 10 classes. Lightweight, ideal for rapid prototyping, unit testing, and benchmarking training loops for image GANs;
CelebA: 200K+ aligned celebrity faces annotated with 40 binary attributes. Frequently used in attribute-conditioned generation, identity-preserving face editing, and encoder-decoder models;
LSUN: large-scale scene dataset containing millions of images in categories like bedrooms, churches, and dining rooms. Essential for high-resolution synthesis and progressive GAN training;
ImageNet: over 14M high-quality images labeled across 20K classes. Used primarily for transfer learning, diffusion model pretraining, and as a base dataset for style-guided generation.

Text Datasets

WikiText: clean Wikipedia articles (WikiText-2: 2M tokens, WikiText-103: 100M+). Valuable for evaluating language modeling and fine-tuning decoder-only models like GPT;
BookCorpus: over 11,000 free novels. Critical for narrative-style generation, long-context transformers, and pretraining of foundational models (e.g., BERT, GPT-2);
Common Crawl / C4: petabyte-scale multilingual web data. C4 is a deduplicated, filtered variant curated for high-quality language model training (e.g., T5);
The Pile: 825GB of diverse data (books, ArXiv, StackExchange, GitHub, etc.). Designed to train GPT-style models competitively with OpenAI’s LLMs.

Summary

Choose datasets based on quality, licensing, scale, and alignment with generative goals;
Apply preprocessing pipelines tailored to each modality using robust, production-grade tools;
Ensure rigorous splitting strategies to support reproducibility, avoid leakage, and enable fair evaluation.

Var alt klart?

Tak for dine kommentarer!

Sektion 3. Kapitel 1

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Kursusindhold

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Data Collection and Preprocessing

Data Collection

Web Scraping Techniques

Scraping methods typically involve:

Sending HTTP requests to retrieve web pages. This enables access to the raw HTML content of a page;
Parsing HTML content to extract structured data. Tools like BeautifulSoup convert unstructured HTML into accessible tags and elements;
Navigating dynamic pages using browser automation. JavaScript-heavy websites require tools like Selenium to fully render content;
Storing extracted data in usable formats like CSV or JSON. This ensures compatibility with later preprocessing and model training steps.

Below are two common scraping strategies:

Scraping Text with BeautifulSoup

BeautifulSoup is a Python library used to parse static HTML pages.


              1234567891011
            
import requests
from bs4 import BeautifulSoup

url = "https://docs.python.org/3/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract paragraphs
paragraphs = [p.text for p in soup.find_all('p')]
text = "\n".join(paragraphs)
print(text)

Scraping Images with Selenium

Selenium automates a browser to scrape content from pages rendered with JavaScript.


              123456789101112131415161718
            
# INSTALL SELENIUM
# THIS CODE DOWNLOAD IMAGES (I NEED JUST VIZUALIZATION)

from selenium import webdriver
import time
import urllib.request

url = "https://example.com/gallery"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)

images = driver.find_elements("tag name", "img")
for idx, img in enumerate(images):
    src = img.get_attribute('src')
    if src:
        urllib.request.urlretrieve(src, f"image_{idx}.jpg")
driver.quit()

Note

Always review a website’s terms of service before scraping. Use polite request rates and respect robots.txt. Improper scraping can lead to IP bans or legal consequences.

Preprocessing Techniques

Image Preprocessing

Resizing: match dataset resolution to model input (e.g., 64x64 for early GANs, 512x512 for diffusion models);

Normalization: scales pixel values to a standard range, typically [−1, 1] or [0, 1];

Color Space Handling: ensure color consistency — convert to RGB or grayscale. For conditional generation, retain alpha channels if present;

Data augmentation: introduces variation during training via transformations.

Text Preprocessing

Cleaning: removes special characters, extra whitespace, and noise;


              12345
            
import re
text = "Example   text — with  symbols!"
cleaned = re.sub(r"[^\w\s]", "", text)
cleaned = re.sub(r"\s+", " ", cleaned).strip()
print(cleaned)

r"[^\w\s]":
- \w: matches any alphanumeric character (letters A-Z, a-z, digits 0-9) and underscore _;
- \s: matches any whitespace character (spaces, tabs, newlines);
- [^...]: a negated character class—matches anything not listed inside;
- Meaning: this pattern matches all characters except letters, digits, underscores, and whitespace. So it removes punctuation and symbols (like —, !, etc.).
r"\s+":
- \s: matches any whitespace character;
- +: matches one or more of the preceding token;
- Meaning: this replaces multiple consecutive whitespace characters with a single space.
.strip(): removes leading and trailing whitespace from the final cleaned string.

For more information on RegEx syntax, refer to the documentation.

Lowercasing: standardizes text to lowercase for consistency. Apply selectively, because models like BERT are cased/uncased;


              12
            
text = "This Is A Sentence."
print(text.lower())

Tokenization: splits text into tokens or subwords for modeling;


              1234
            
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
encodings = tokenizer(["Example text."], padding="max_length", truncation=True, max_length=128, return_tensors="pt")

Stemming/Lemmatization: rare in deep learning pipelines but used in traditional NLP or pretraining filters;


              1234
            
from nltk.stem import PorterStemmer
  
stemmer = PorterStemmer()
print(stemmer.stem("running"))

Padding/Truncation: see example above with max_length.


              1234
            
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Short text.", max_length=10, padding="max_length", truncation=True, return_tensors="pt")

Note

Build modular preprocessing pipelines with reproducibility in mind. Use DVC, wandb artifacts, or huggingface/datasets with streaming + caching.

Data Splitting Strategies

Effective data splitting is essential for generalization, especially in generative modeling where overfitting to modes or memorization is common.

Train/Validation/Test Split

Conventional ratios: 80/10/10 or 70/15/15 depending on dataset size;
Content-aware splitting: stratify splits by class (vision), topic (text).
Use case:
- Training: drives model optimization;
- Validation: guides checkpointing, early stopping, and metric tuning (e.g., FID);
- Test: held back completely until final model benchmarking.

Example using train_test_split:

For Hugging Face Datasets:

Cross-Validation and Bootstrapping

For low-resource or zero-shot domains, use K-fold CV (e.g., K=5 or 10);
In diffusion models, use bootstrapped FID/LPIPS to evaluate generation stability;
Visual or perceptual inspection should accompany numerical validation.

Example K-fold setup:

Commonly Used Datasets

Choosing the right dataset depends on modality, data scale, licensing, and the specific generative objective (e.g., unconditional generation, conditional synthesis, or style transfer).

Computer Vision Datasets

CIFAR-10: 60,000 low-resolution 32×32 RGB images in 10 classes. Lightweight, ideal for rapid prototyping, unit testing, and benchmarking training loops for image GANs;
CelebA: 200K+ aligned celebrity faces annotated with 40 binary attributes. Frequently used in attribute-conditioned generation, identity-preserving face editing, and encoder-decoder models;
LSUN: large-scale scene dataset containing millions of images in categories like bedrooms, churches, and dining rooms. Essential for high-resolution synthesis and progressive GAN training;
ImageNet: over 14M high-quality images labeled across 20K classes. Used primarily for transfer learning, diffusion model pretraining, and as a base dataset for style-guided generation.

Text Datasets

WikiText: clean Wikipedia articles (WikiText-2: 2M tokens, WikiText-103: 100M+). Valuable for evaluating language modeling and fine-tuning decoder-only models like GPT;
BookCorpus: over 11,000 free novels. Critical for narrative-style generation, long-context transformers, and pretraining of foundational models (e.g., BERT, GPT-2);
Common Crawl / C4: petabyte-scale multilingual web data. C4 is a deduplicated, filtered variant curated for high-quality language model training (e.g., T5);
The Pile: 825GB of diverse data (books, ArXiv, StackExchange, GitHub, etc.). Designed to train GPT-style models competitively with OpenAI’s LLMs.

Summary

Choose datasets based on quality, licensing, scale, and alignment with generative goals;
Apply preprocessing pipelines tailored to each modality using robust, production-grade tools;
Ensure rigorous splitting strategies to support reproducibility, avoid leakage, and enable fair evaluation.

Var alt klart?

Tak for dine kommentarer!

Sektion 3. Kapitel 1

Generative AI

Data Collection and Preprocessing

Data Collection

Web Scraping Techniques

Scraping Text with BeautifulSoup

Scraping Images with Selenium

Preprocessing Techniques

Image Preprocessing

Text Preprocessing

Data Splitting Strategies

Train/Validation/Test Split

Cross-Validation and Bootstrapping

Commonly Used Datasets

Computer Vision Datasets

Text Datasets

Summary

1. Why is data quality more important than quantity in training generative AI models?

2. What is one common challenge when collecting diverse data for training generative models?

3. What is the primary goal of data augmentation in the context of generative AI training?

Generative AI

Data Collection and Preprocessing

Data Collection

Web Scraping Techniques

Scraping Text with BeautifulSoup

Scraping Images with Selenium

Preprocessing Techniques

Image Preprocessing

Text Preprocessing

Data Splitting Strategies

Train/Validation/Test Split

Cross-Validation and Bootstrapping

Commonly Used Datasets

Computer Vision Datasets

Text Datasets

Summary

1. Why is data quality more important than quantity in training generative AI models?

2. What is one common challenge when collecting diverse data for training generative models?

3. What is the primary goal of data augmentation in the context of generative AI training?