From the course: Statistics Foundations 3: Using Data Sets

The importance of sample size

- A sample is a group of units drawn from a population, and the sample size is the number of units drawn and measured for that particular sample. Now, the total population itself may be very large or perhaps immeasurable, so a sample is just looking at a slice of the population in the hopes of providing us a representative picture of the entire population. As you might guess, the larger the sample size, the more accurate our measurement, or at least the more confidence we have that our sample is a good representative of the entire population. But, just how important is sample size? Well, let's first establish how an experiment might look. Let's say we own a machine that manufacturers forks, the forks manufactured by the machine are judged as either acceptable or defective. This is a magic fork manufacturing machine. Over its entire existence, it will manufacture exactly 90% good forks, and exactly 10% defective forks. The machine won't break down and it won't get tired, as I said, it's a magic fork machine. So, the value of p for this machine is 0.90. Remember, p is the proportion of good forks this machine produces, 90% of all the forks this machine will ever produce will be good forks. Look, we don't actually know that right now, the machine can't tell us that. We can't know the actual true p until the machine has produced a lifetime of forks. But if enough samples are collected over time, we would find that the average of the proportions for all the samples measured would approach the true p value of 0.90. In our Central Limit Theorem video, I'll show you how this happens, but for now, let's just accept that this is true. Anyway, if we had five forks in our sample, and four were good and one was bad, p^, which is the acceptable proportion for this particular sample, p^ would be four good forks divided by five total forks in our sample. p^ is equal to 0.80. That's just one sample, you'd collect a bunch of samples, average their p^, and hope that the samples were pointing you toward the true p. I know, I know, you're wondering, "What does this have to do with a sample size?" Well, without actually knowing the true p, 0.90, using only the average of all our previous p^, what we'd like to know is, for any given sample, how likely is it that I'm close to the real p 0.90? Well, let's assume our p^ are normally distributed, in other words, they take on the shape of the famous bell curve. If that's true, we can calculate our standard deviation for our p^ and assume that 68% of our p^ would fall within one standard deviation of the true value of p. And this is where sample size becomes important, why? Well, let's take a look at the formula for standard deviation. In this formula, p would be 0.90, that would not change, but we want to see the impact of having a larger sample size. n represents the sample size. For this example, if n is equal to five, then one standard deviation would be 0.134. So it's 13.4% from 90% in either direction, which means that with a sample size of five, we would expect 68% of all of our samples to have between 76.6% and 100% good forks. 100% is the maximum since we can't have more than 100% good forks. 76.6% to 100% is a pretty large range. Let's try some larger sample sizes. If n is equal to 25, 1 standard deviation would be 6%. If n is equal to 100, 1 standard deviation would be 3%. And n equals 400 gives us a standard deviation of 1.5%. At n equals 400, we would expect 68% of all our samples collected to have between 88.5% and 91.5% good forks, much closer to our true p. The bigger the sample size, the smaller our standard deviation. The bigger the sample size, the more confident we are that our sample's p^ is close to the population's actual p.

Contents