Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Understading Sampling | Probability & Statistics
Mathematics for Data Science

bookUnderstading Sampling

Sampling is a foundational concept in data science. Since collecting data from an entire population is often impractical or impossible, we use sampling to gather meaningful insights from smaller subsets.
The quality of your analysis depends heavily on the sampling method you choose.

Simple Random Sampling

Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.

P(Select any individual)=1NP(\text{Select any individual}) = \frac{1}{N}

Where:

  • NN = population size.

Example 1:

You have a class of 30 students. You want to randomly select 5 for a survey.

Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 130\tfrac{1}{30} chance of being selected.

Example 2:

You have a class of 30 students and want to select 5 to participate in a survey.

  • Total population: N=30N=30
  • Sample size: n=5n=5

What is the probability that Alice and Bob are both selected?

Total number of ways to choose 5 students from 30:

(305)\binom{30}{5}

Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob — choose 3 more from the remaining 28:

(283)\binom{28}{3}

So the probability is:

P=(283)(305)P = \frac{\binom{28}{3}}{\binom{30}{5}}

Stratified Sampling

The population is divided into meaningful subgroups (strata), and random samples are taken from each.

nh=NhN×nn_h = \frac{N_h}{N} \times n

Where:

  • NhN_h - size of subgroup hh;
  • NN - total population size;
  • nn - total sample size;
  • nhn_h - sample size from subgroup hh.

Example:

A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:

  • From males: 1830×10=6\tfrac{18}{30} \times 10 = 6
  • From females: 1230×10=4\tfrac{12}{30} \times 10 = 4

Why it's good: Ensures representation of key subgroups.

Cluster Sampling

The population is split into groups (clusters), and entire clusters are randomly selected.

c=number of clusters to samplec = \text{number of clusters to sample}

Where:

  • Clusters are pre-existing groups (e.g., classrooms, teams).
  • You randomly pick entire clusters, not individuals.

Example 1:

Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.

Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.

Example 2:

A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.

  • Number of clusters: N=20N=20
  • Selected clusters: n=4n=4
  • Students per dorm: M=50M=50
  • Total students sampled: n×M=200n \times M = 200

What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:

P(Sarah selected)=420=0.2P(\text{Sarah selected}) = \frac{4}{20} = 0.2

Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?

Let:

  • D30=10D_{30} = 10 dorms with 30 students,
  • D70=10D_{70} = 10 dorms with 70 students.

Expected sample size:

E=1020(4×30)+1020(4×70)=200E = \frac{10}{20} \cdot (4 \times 30) + \frac{10}{20} \cdot (4 \times 70) = 200

So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.

Systematic Sampling

Select every kk-th item from a list.

k=Nnk = \frac{N}{n}

Where:

  • NN - total population;
  • nn - sample size desired;
  • kk - sampling interval.

Example:

A list of 1000 customers. You want a sample of 100. So:

k=1000100=10k = \frac{1000}{100} = 10

Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.

Why it's good: Easy to implement and systematic.

All Methods Applied to One Problem

Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students;

  • Simple Random: randomly pick 30 names from the full list.
  • Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
  • Cluster: randomly select 1 class (30 students) and survey all;
  • Systematic: pick every 10th student from an ordered list.

Summary

  • Sampling reduces data collection effort while allowing generalization;
  • Random and stratified sampling are best for accuracy;
  • Cluster sampling is efficient but works best when clusters are similar;
  • Systematic sampling is simple and practical;
  • Convenience sampling is risky and should be avoided when possible;
  • Always document your sampling method in real-world analysis.

Quiz

Q1.


Q2.


Q3.


Q4.


Q5.


Q6.

1. Which method ensures every individual has an equal chance of selection?

2. In stratified sampling, you divide the population into:

3. Cluster sampling selects:

4. If there are 200 students and you want 20 via systematic sampling, k=?k = ?

5. Which method might skip natural group divisions?

6. Cluster sampling works best when:

question mark

Which method ensures every individual has an equal chance of selection?

Select the correct answer

question mark

In stratified sampling, you divide the population into:

Select the correct answer

question mark

Cluster sampling selects:

Select the correct answer

question mark

If there are 200 students and you want 20 via systematic sampling, k=?k = ?

Select the correct answer

question mark

Which method might skip natural group divisions?

Select the correct answer

question mark

Cluster sampling works best when:

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 5

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you provide the quiz questions?

Can you give me sample answers for the quiz?

Can you explain how to approach answering these quiz questions?

Awesome!

Completion rate improved to 1.89

bookUnderstading Sampling

Swipe um das Menü anzuzeigen

Sampling is a foundational concept in data science. Since collecting data from an entire population is often impractical or impossible, we use sampling to gather meaningful insights from smaller subsets.
The quality of your analysis depends heavily on the sampling method you choose.

Simple Random Sampling

Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.

P(Select any individual)=1NP(\text{Select any individual}) = \frac{1}{N}

Where:

  • NN = population size.

Example 1:

You have a class of 30 students. You want to randomly select 5 for a survey.

Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 130\tfrac{1}{30} chance of being selected.

Example 2:

You have a class of 30 students and want to select 5 to participate in a survey.

  • Total population: N=30N=30
  • Sample size: n=5n=5

What is the probability that Alice and Bob are both selected?

Total number of ways to choose 5 students from 30:

(305)\binom{30}{5}

Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob — choose 3 more from the remaining 28:

(283)\binom{28}{3}

So the probability is:

P=(283)(305)P = \frac{\binom{28}{3}}{\binom{30}{5}}

Stratified Sampling

The population is divided into meaningful subgroups (strata), and random samples are taken from each.

nh=NhN×nn_h = \frac{N_h}{N} \times n

Where:

  • NhN_h - size of subgroup hh;
  • NN - total population size;
  • nn - total sample size;
  • nhn_h - sample size from subgroup hh.

Example:

A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:

  • From males: 1830×10=6\tfrac{18}{30} \times 10 = 6
  • From females: 1230×10=4\tfrac{12}{30} \times 10 = 4

Why it's good: Ensures representation of key subgroups.

Cluster Sampling

The population is split into groups (clusters), and entire clusters are randomly selected.

c=number of clusters to samplec = \text{number of clusters to sample}

Where:

  • Clusters are pre-existing groups (e.g., classrooms, teams).
  • You randomly pick entire clusters, not individuals.

Example 1:

Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.

Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.

Example 2:

A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.

  • Number of clusters: N=20N=20
  • Selected clusters: n=4n=4
  • Students per dorm: M=50M=50
  • Total students sampled: n×M=200n \times M = 200

What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:

P(Sarah selected)=420=0.2P(\text{Sarah selected}) = \frac{4}{20} = 0.2

Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?

Let:

  • D30=10D_{30} = 10 dorms with 30 students,
  • D70=10D_{70} = 10 dorms with 70 students.

Expected sample size:

E=1020(4×30)+1020(4×70)=200E = \frac{10}{20} \cdot (4 \times 30) + \frac{10}{20} \cdot (4 \times 70) = 200

So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.

Systematic Sampling

Select every kk-th item from a list.

k=Nnk = \frac{N}{n}

Where:

  • NN - total population;
  • nn - sample size desired;
  • kk - sampling interval.

Example:

A list of 1000 customers. You want a sample of 100. So:

k=1000100=10k = \frac{1000}{100} = 10

Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.

Why it's good: Easy to implement and systematic.

All Methods Applied to One Problem

Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students;

  • Simple Random: randomly pick 30 names from the full list.
  • Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
  • Cluster: randomly select 1 class (30 students) and survey all;
  • Systematic: pick every 10th student from an ordered list.

Summary

  • Sampling reduces data collection effort while allowing generalization;
  • Random and stratified sampling are best for accuracy;
  • Cluster sampling is efficient but works best when clusters are similar;
  • Systematic sampling is simple and practical;
  • Convenience sampling is risky and should be avoided when possible;
  • Always document your sampling method in real-world analysis.

Quiz

Q1.


Q2.


Q3.


Q4.


Q5.


Q6.

1. Which method ensures every individual has an equal chance of selection?

2. In stratified sampling, you divide the population into:

3. Cluster sampling selects:

4. If there are 200 students and you want 20 via systematic sampling, k=?k = ?

5. Which method might skip natural group divisions?

6. Cluster sampling works best when:

question mark

Which method ensures every individual has an equal chance of selection?

Select the correct answer

question mark

In stratified sampling, you divide the population into:

Select the correct answer

question mark

Cluster sampling selects:

Select the correct answer

question mark

If there are 200 students and you want 20 via systematic sampling, k=?k = ?

Select the correct answer

question mark

Which method might skip natural group divisions?

Select the correct answer

question mark

Cluster sampling works best when:

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 5
some-alt