Hypothesis Testing on Large sample

date posted: 2020-06-07




TMI of the day

  1. If I post more than one blog a day, I've got nothing to say sorry.


What is considered large sample?

In previous example on small sample hypothesis testing our sample size was 15. This time we set our sample size >= 30 and say it is large, why?

We are using binomial distribution and when n = 20 and p = 0.5 binomial distribution looks like normal distribution and as n gets larger it will look more like it. value of p decides skewness and general rule is that when np > 5 and nq > 5 we can use normal distribution to estimate values we would get from binomial distribution formula. In our case p = 0.9 and q = 0.1 and in order for us to take advantage of normal distribution lets set n = 100 so that np = 90 and nq = 10 both greater than 5.


Hypothesis testing on large sample

In this example lets say we picked 100 sample and 80 people were cured and 20 were not.

Steps 1 and 3will be the same as before with small sample. The only difference would be test statistics since instead of using binomial distribution formula we are able to take advantage of normal distribution.

Note that we could also use binomial distribution formula as before however it would be cumbersome and prone to error since formula goes like this:
P(X <=80) = 1 - P(X >= 81)
P(X <=80) = 1 - ([(100C81) * (0.1 ^19) * (0.9 ^ 81)] + [(100C82) * (0.1 ^18) * (0.9 ^ 82)] +.... + [(100C99) * (0.1 ^1) * (0.9 ^ 99)])

You can visit chapter 9 of Head first statistics or simply search on google to learn about using normal distribution to approximate binomial distribution. References are posted at very end of the blog.

We know that X ~ N(mean, variance). By using normal distribution to find approximate value of binomial distribution we get E(X) = np = expected mean, and Var(X) = npq = expected variance.
Plugging these we get X ~ N(np, npq) = X ~ N(90, 9).

Step 4: Find Z-score

This part is different from previous small sample exmaple. Since we are assuming normal distribution we can calculate Z-score which basically means how far away from the mean (expected mean in this case) is our outcome value, in our case having 80 people cured.
Z = (X - x)/ standard_deviation, where x = np = expected mean.
Z = (X - 90)/ sqrt(9) = (80 - 90) / 3 = -10/3 = -3.3333

Looking up P(Z <= -3.33) we get 0.0004 which is our p-value. Note that in this case Z-score is our test statistics since it is used to reject or not reject null hypothesis.

Step 4: Conclusion

We have set our alpha = 0.05. We found out that our p-value = 0.0004 therefore 0.05 > 0.0004, null is rejected and alternative hypothesis is accepted as true. P-value of 0.0004 means that assuming null hypothesis is true(drug is effective 90% of the time) probability of getting 80/100 cured as an outcome is 0.4% which is very unlikely and thus we are 99.6% confident that our decision of rejecting null is appropriate.