Understading Sampling
Sampling is a foundational concept in data science. Since collecting data from an entire population is often impractical or impossible, we use sampling to gather meaningful insights from smaller subsets.
The quality of your analysis depends heavily on the sampling method you choose.
Simple Random Sampling
Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.
Where:
- N = population size.
Example 1:
You have a class of 30 students. You want to randomly select 5 for a survey.
Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 301 chance of being selected.
Example 2:
You have a class of 30 students and want to select 5 to participate in a survey.
- Total population: N=30
- Sample size: n=5
What is the probability that Alice and Bob are both selected?
Total number of ways to choose 5 students from 30:
(530)Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob — choose 3 more from the remaining 28:
So the probability is:
P=(530)(328)Stratified Sampling
The population is divided into meaningful subgroups (strata), and random samples are taken from each.
nh=NNh×nWhere:
- Nh - size of subgroup h;
- N - total population size;
- n - total sample size;
- nh - sample size from subgroup h.
Example:
A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:
- From males: 3018×10=6
- From females: 3012×10=4
Why it's good: Ensures representation of key subgroups.
Cluster Sampling
The population is split into groups (clusters), and entire clusters are randomly selected.
c=number of clusters to sampleWhere:
- Clusters are pre-existing groups (e.g., classrooms, teams).
- You randomly pick entire clusters, not individuals.
Example 1:
Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.
Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.
Example 2:
A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.
- Number of clusters: N=20
- Selected clusters: n=4
- Students per dorm: M=50
- Total students sampled: n×M=200
What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:
Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?
Let:
- D30=10 dorms with 30 students,
- D70=10 dorms with 70 students.
Expected sample size:
E=2010⋅(4×30)+2010⋅(4×70)=200So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.
Systematic Sampling
Select every k-th item from a list.
k=nNWhere:
- N - total population;
- n - sample size desired;
- k - sampling interval.
Example:
A list of 1000 customers. You want a sample of 100. So:
k=1001000=10Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.
Why it's good: Easy to implement and systematic.
All Methods Applied to One Problem
Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students;
- Simple Random: randomly pick 30 names from the full list.
- Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
- Cluster: randomly select 1 class (30 students) and survey all;
- Systematic: pick every 10th student from an ordered list.
Summary
- Sampling reduces data collection effort while allowing generalization;
- Random and stratified sampling are best for accuracy;
- Cluster sampling is efficient but works best when clusters are similar;
- Systematic sampling is simple and practical;
- Convenience sampling is risky and should be avoided when possible;
- Always document your sampling method in real-world analysis.
Quiz
Q1.
Q2.
Q3.
Q4.
Q5.
Q6.
1. Which method ensures every individual has an equal chance of selection?
2. In stratified sampling, you divide the population into:
3. Cluster sampling selects:
4. If there are 200 students and you want 20 via systematic sampling, k=?
5. Which method might skip natural group divisions?
6. Cluster sampling works best when:
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Awesome!
Completion rate improved to 1.89
Understading Sampling
Scorri per mostrare il menu
Sampling is a foundational concept in data science. Since collecting data from an entire population is often impractical or impossible, we use sampling to gather meaningful insights from smaller subsets.
The quality of your analysis depends heavily on the sampling method you choose.
Simple Random Sampling
Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.
Where:
- N = population size.
Example 1:
You have a class of 30 students. You want to randomly select 5 for a survey.
Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 301 chance of being selected.
Example 2:
You have a class of 30 students and want to select 5 to participate in a survey.
- Total population: N=30
- Sample size: n=5
What is the probability that Alice and Bob are both selected?
Total number of ways to choose 5 students from 30:
(530)Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob — choose 3 more from the remaining 28:
So the probability is:
P=(530)(328)Stratified Sampling
The population is divided into meaningful subgroups (strata), and random samples are taken from each.
nh=NNh×nWhere:
- Nh - size of subgroup h;
- N - total population size;
- n - total sample size;
- nh - sample size from subgroup h.
Example:
A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:
- From males: 3018×10=6
- From females: 3012×10=4
Why it's good: Ensures representation of key subgroups.
Cluster Sampling
The population is split into groups (clusters), and entire clusters are randomly selected.
c=number of clusters to sampleWhere:
- Clusters are pre-existing groups (e.g., classrooms, teams).
- You randomly pick entire clusters, not individuals.
Example 1:
Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.
Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.
Example 2:
A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.
- Number of clusters: N=20
- Selected clusters: n=4
- Students per dorm: M=50
- Total students sampled: n×M=200
What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:
Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?
Let:
- D30=10 dorms with 30 students,
- D70=10 dorms with 70 students.
Expected sample size:
E=2010⋅(4×30)+2010⋅(4×70)=200So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.
Systematic Sampling
Select every k-th item from a list.
k=nNWhere:
- N - total population;
- n - sample size desired;
- k - sampling interval.
Example:
A list of 1000 customers. You want a sample of 100. So:
k=1001000=10Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.
Why it's good: Easy to implement and systematic.
All Methods Applied to One Problem
Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students;
- Simple Random: randomly pick 30 names from the full list.
- Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
- Cluster: randomly select 1 class (30 students) and survey all;
- Systematic: pick every 10th student from an ordered list.
Summary
- Sampling reduces data collection effort while allowing generalization;
- Random and stratified sampling are best for accuracy;
- Cluster sampling is efficient but works best when clusters are similar;
- Systematic sampling is simple and practical;
- Convenience sampling is risky and should be avoided when possible;
- Always document your sampling method in real-world analysis.
Quiz
Q1.
Q2.
Q3.
Q4.
Q5.
Q6.
1. Which method ensures every individual has an equal chance of selection?
2. In stratified sampling, you divide the population into:
3. Cluster sampling selects:
4. If there are 200 students and you want 20 via systematic sampling, k=?
5. Which method might skip natural group divisions?
6. Cluster sampling works best when:
Grazie per i tuoi commenti!