Sampling
When we think about probability, we usually think about cases in which we know how common an event is, and we want to know the odds it will happen in a particular instance.
For example, we know that a coin toss will turn up heads 50% of the time. So, when we flip a coin, we have a 50% chance of getting heads.
Or, we know that there are 4 aces in a deck of cards, so if we pull a random card from a deck, we have a $\dfrac{4}{52}$ chance of pulling an ace.
We can also use probability to work backwards: we can use how often a random event occurs in a sample to figure out how often it occurs in a larger population.
For example, if we take 10 random jelly beans from a huge jar and 6 of the jelly beans are pink, we can suppose that 60% of the jelly beans in the jar are pink.
Or, if we randomly select 100 kids from a school and 10 of them have red hair, we can predict that 10% of the kids at the school have red hair.
This is very helpful. If you want to know something about a population of people or things, you probably don't want to ask or look at everyone in the population (if you have 700 rocks to analyze under a microscope, that will take a LONG time!). You can study a sample (or small group) of the population!
So, you can take a random sample of those people or objects (let's study 100 random rocks from that population of 700). The proportions you learn about that sample (10% contain mica; 50% are sedentary; 2% contain gold) should apply to the entire population of rocks. And, the process is much faster!
The terms we should know before we go on:
- Population: the entire group of people or items you are interested in. This can people a group of people, rocks, animals, fabric samples, etc.
- Sample: a small subset of the population.
- Random: chosen without any order, pattern, or bias. (For samples to statistically represent populations, they must be random.)
- Theoretical probability: the statistical odds that something should happen (note: this is not the same as what actually happens, just what should happen according to statistical rules of probability).
Some things to know about samples:
- Larger samples are more statistically precise than smaller samples.
- Think about it: if you flip a coin 100 times, you'll probably get 50 heads and 50 tails. But, if you only flip it twice, you could get two heads or two tails. Probability is most precise with larger samples.
- Results in samples should be proportional to results in populations.
- Whatever percents or proportions you find in a large enough random sample SHOULD apply to the entire population.
- Researchers should choose samples that generalize to the population they want to talk about.
- Before you choose your sample, you need to figure out what population you are interested in. Results from a sample can only apply to the population it was taken from. So, for instance, if you randomly sample 11th graders and find that 30% of them love math, you can say that 30% of 11th graders love math, but not that 30% of high school students love math. If you only sample rocks from the desert, you can only generalize to rocks from the desert. To draw conclusions about a particular population, make sure you sample from that population! Here are some more examples of failure to create a generalizable sample:
- If you want to talk to 11th graders, but only talk to 11th graders who are hanging around campus at 2:00 you miss out on 11th graders who have class at 2:00 (perhaps the more studious ones), the ones who have jobs (perhaps the more motivated ones), the ones who play sports (the more athletic ones). You can't talk about all 11th graders if you only talk to a particular group of them.
- If you want to survey people, but ask for volunteers to participate. Then you can only generalize to the type of people who would want to participate in a survey!
- If you want to study dogs, but look for dogs at dog parks, then you can only generalize to dogs who go to dog parks. This might be particularly important if you were studying aggression in dogs. Aggressive dogs are less likely to go to a dog park! So, by studying dogs at the dog park, you might conclude that dogs are not that aggressive, but in truth, you only learned that dogs who go to the dog park are not that aggressive.
- Before you choose your sample, you need to figure out what population you are interested in. Results from a sample can only apply to the population it was taken from. So, for instance, if you randomly sample 11th graders and find that 30% of them love math, you can say that 30% of 11th graders love math, but not that 30% of high school students love math. If you only sample rocks from the desert, you can only generalize to rocks from the desert. To draw conclusions about a particular population, make sure you sample from that population! Here are some more examples of failure to create a generalizable sample:
- Selection bias can make samples useless for statistical predictions.
- It is critical that samples be selected randomly if you want to predict about populations. Many studies go awry because there is selection bias in the sample (in other words, the sample was not random). If the sample is not random, the rules of theoretical probability DO NOT apply. What does selection bias look like:
- Volunteer sampling (asking people if they want to participate), snowball sampling (asking people who take part in the study to ask other people they know to take part in the study), convenience sampling (asking people you know or see a lot) are all non-random samples. Non-random samples can tell you something about the actual people you talk to, but they cannot tell you about any bigger population. The do not statistically represent the entire population.
- If you randomly recruit people, but not all of them participate, it also introduces selection bias. Let's say you want to see how well 11th graders at your school will do on the SAT. You use school rosters to randomly select 20 students to take an SAT test on Saturday. So far, so good. But, only 10 students come on Saturday. You can learn how those students score. But, you cannot generalize to all of the 11th graders. It could be that the students who would have scored lower are the ones who did not come. Your sample is not random. You cannot predict about the larger population.
- Sometimes we recruit items or people for a controlled experiment, in which case people must be assigned to the treatment groups randomly. Let's say that you want to see if an acne cream works on students at your school. You randomly recruit 100 students from your school. You will give half of them the real acne cream and half of them the fake acne cream to put on their faces. If the kids who get the real acne cream have less acne after a month, you will tell your fellow students that it works! So far, you have a great experiment set up. But, you must assign students randomly to the real v. fake acne cream groups for the study to work. If you allow all of your friends into the real acne cream group, then your study is no longer random. Even if the acne cream works perfectly, you cannot predict to the population of your school.
- Note: in many ways, selection bias and generalizability are related. And, if you change the group that you want to generalize to, you can eliminate some forms of selection bias. For instance, let's say that you only want to see if the acne cream works on people who want to get rid of their acne. Then, you could ask for volunteers to the study. As long as you randomly assign those volunteers to either real or fake acne cream, then your study can be valid. But, remember, you can only generalize to people who want to get rid of their acne (you cannot generalize to the entire school). That might be fine. Who cares if acne cream works on people who want to keep their acne? The important thing to remember is: who do you want to generalize to? Grab a random sample of THEM.
- It is critical that samples be selected randomly if you want to predict about populations. Many studies go awry because there is selection bias in the sample (in other words, the sample was not random). If the sample is not random, the rules of theoretical probability DO NOT apply. What does selection bias look like:
Practice Problems:
Sampling
You want to learn what types of books the students in your school like. There are 1,000 students at your school, 250 in each class. You decide to sample 100 sophomores at your schools. You use a computer program to select the sophomores randomly. You find that 25% like fantasy books, 10% like mysteries, 15% like romances, 5% like historical books, and the rest don't like to read at all.
- What is the population of our study?
- How many people are in the population?
- How many people are in your sample?
- What kind of sample is it?
- Based on your results, what percent of the sophomores don't like to read at all?
- Based on your results, how many sophomores like romances?
- Based on your results, how many juniors like mysteries?
- Based on your results, what percent of students at your school like fantasy books?
A town is trying to ways to improve its traffic problem. It wonders if offering a service where people can pick up and rent bikes around the town would help. Alternatively, it is considering a service where people can pick up and rent scooters around the town. Or, it could spend money to expand one of the main roads by one lane. There are 10,000 people living in the town, and it has a large spring festival every year, where many townspeople come to the main park and participate in a range of activities (listening to speakers and concerts, playing games, eating from food trucks, etc.). The town had a booth at the festival, and asked adult town residents to fill out a survey about ways to solve the town's traffic problem. By the end of the festival 567 people had filled out the survey. It found that 50% of people wanted a bike system, 30% of people wanted a scooter system, and 20% of people wanted an expanded road.
- What population did the town want to survey?
- What population did the town survey?
- How many people were in the sample?
- What kind of sample was it?
- Should the town decide to start a bike program?
- Why, or why not, should the town follow the results of this study?