Hypothesis Testing: Two Proportions (aka Difference of Proportions)
Introduction
Two Proportions hypothesis tests are used when...
- You are comparing two different populations
- You have TWO proportions from TWO INDEPENDENT random samples
For example, as a researcher, you might want to know if there is a difference in the proportion of males who use Facebook and the proportion of females who use Facebook. A quality control specialist might also want to know if there is a difference in the percentage of defective items produced by two different machines.
A few symbols need to be defined before we dive in:
- \(\hat{p_1}\) and \(\hat{p_2}\) refer to the sample proportions that you will use to disprove the null.
- \(\hat{p_c}\) refers to the combined proportion (formula down below ↓ )
- \(\hat{q_c}\) refers to 1 minus the combined proportion, i.e. \(1 - \hat{p_c}\)
- \(n_1\) and \(n_2\) refer to the sample sizes.
Example
A columnist claims that women are more safety-conscious than men when it comes to driving. A recent survey on use of seatbelts was done among a random sample of 150 men and 250 women. Based on the results, 105 men said they always wear seatbelts when driving and 186 women said the same. Using a 0.05 level of significance, do the results of the survey support the columnist’s claim?
Step 1: Name Test: 2-Proportions / Difference of Proportions
Step 2: Define Test:
With this null hypothesis, the options for the alternative hypothesis are as follows:
Left-Sided Test | Two-Sided Test | Right-Sided Test |
\(H_0: p_1 = p_2\) \(H_A: p_1 < p_2\) |
\(H_0: p_1 = p_2\) \(H_A: p_1 \neq p_2\) |
\(H_0: p_1 = p_2\) \(H_A: p_1 > p_2\) |
In this case, let's call the proportion of men who wear seatbelts \(p_M\) and the proportion of women who wear seatbelts \(p_W\). If the alternative hypothesis is that women are more safety-conscious than men, then women should have a higher seatbelt usage and \(p_W > p_M\).
\(H_0 : p_W = p_M\)
\(H_A : p_W > p_M\)
Step 3: Assume \(H_0\) is true and define its normal distribution. Then check the conditions.
1. The data is drawn from TWO independent random samples.
2a. From Sample 1: \(N_1 ≥ 10n_1\)
2b. From Sample 2: \(N_2 ≥ 10n_2\)
3a. From Sample 1: \(n_1 \hat{p_1} ≥ 10\) and \(n_1 \hat{q_1} ≥ 10\)
3b. From Sample 2: \(n_2 \hat{p_2} ≥ 10\) and \(n_2 \hat{q_2} ≥ 10\)
Step 4: Using the normal distribution, calculate the test statistics and p-value.
Although the full formula is \(z = {( \hat{p_1} - \hat{p_2} ) - ( p_1 - p_2) \over \sqrt { {\hat{p_c}\hat{q_c} \over n_1} + {\hat{p_c}\hat{q_c} \over n_2} } }\) , it can be simplified. Recall that the null is \(H_0 : p_1 = p_2\). Thus, \(p_1 - p_2 = 0\) . This leaves us with the formula below:
Now, let's consider how to calculate the combined proportion \(\hat{p_c}\). Recall that the proportion \(\widehat{p}\) of a sample having a certain attribute is given by \(\widehat{p} = {x \over n}\) , where \(x\) is the number of elements in the sample possessing that certain attribute and \(n\) is the sample size. Thus, the combined proportion \(\hat{p_c}\) is calculated as follows:
Test Statistic:
\(\hat{p_W} = {186 \over 250} = 0.744\) and \(\hat{p_M} = {105 \over 150} = 0.70\)
\(\hat{p_c} = {186 + 105 \over 250 + 150} = 0.7275\)
→ \(z = {( 0.744 - 0.70 ) \over \sqrt { {(0.7275)(0.2725) \over 250} + {(0.7275)(0.2725) \over 150} } }\) → \(z = 1.045\)
P-Value:
The p-value will be found by using the normal cdf function on your calculator:
- lower limit: \(z\)
- upper limit: 999
- distribution center: 0
- standard deviation: 1
- All together, it looks like this: normalcdf (\(z\), 999, 0, 1)
*Note: If it was a left-sided test and the test statistic was negative (z < 0), then your lower limit would be -999 and your upper limit would be the test statistic (\(z\)).
In this case, we do normalcdf (1.045, 999, 0, 1) to get a p-value of 0.15.
Step 5: Analyze your results and determine if they are statistically significant.
We calculated a p-value of 0.15. This p-value is greater than the significance level of 0.05. Therefore, we FAIL to reject the null hypothesis. The data does NOT support the columnist’s claim that there is a difference between the proportion of men and women who always use seatbelt when driving.