Chi-Square Test of Independence
What is the Chi-square test of independence?
The Chi-square test of independence is a statistical hypothesis test used to determine whether two categorical or nominal variables are likely to be related or not.
When can I use the test?
You can use the test when you have counts of values for two categorical variables.
Can I use the test if I have frequency counts in a table?
Yes. If you have only a table of values that shows frequency counts, you can use the test.
Using the Chi-square test of independence
The Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a way to decide if our idea is plausible or not.
The sections below discuss what we need for the test, how to do the test, understanding results, statistical details and understanding p-values.
What do we need?
For the Chi-square test of independence, we need two variables. Our idea is that the variables are not related. Here are a couple of examples:
-
We have a list of movie genres; this is our first variable. Our second variable is whether or not the patrons of those genres bought snacks at the theater. Our idea (or, in statistical terms, our null hypothesis) is that the type of movie and whether or not people bought snacks are unrelated. The owner of the movie theater wants to estimate how many snacks to buy. If movie type and snack purchases are unrelated, estimating will be simpler than if the movie types impact snack sales.
-
A veterinary clinic has a list of dog breeds they see as patients. The second variable is whether owners feed dry food, canned food or a mixture. Our idea is that the dog breed and types of food are unrelated. If this is true, then the clinic can order food based only on the total number of dogs, without consideration for the breeds.
For a valid test, we need:
-
Data values that are a simple random sample from the population of interest.
-
Two categorical or nominal variables. Don't use the independence test with continous variables that define the category combinations. However, the counts for the combinations of the two categorical variables will be continuous.
-
For each combination of the levels of the two variables, we need at least five expected values. When we have fewer than five for any one combination, the test results are not reliable.
Chi-square test of independence example
Let’s take a closer look at the movie snacks example. Suppose we collect data for 600 people at our theater. For each person, we know the type of movie they saw and whether or not they bought snacks.
Let’s start by answering: Is the Chi-square test of independence an appropriate method to evaluate the relationship between movie type and snack purchases?
-
We have a simple random sample of 600 people who saw a movie at our theater. We meet this requirement.
-
Our variables are the movie type and whether or not snacks were purchased. Both variables are categorical. We meet this requirement.
-
The last requirement is for more than five expected values for each combination of the two variables. To confirm this, we need to know the total counts for each type of movie and the total counts for whether snacks were bought or not. For now, we assume we meet this requirement and will check it later.
It appears we have indeed selected a valid method. (We still need to check that more than five values are expected for each combination.)
The expected counts for each Movie-Snack combination are based on the row and column totals. We multiply the row total by the column total and then divide by the grand total. This gives us the expected count for each cell in the table. For example, for the Action-Snacks cell, we have:
$ \frac{125\times310}{600} = \frac{38,750}{600} = 65 $
We rounded the answer to the nearest whole number. If there is not a relationship between movie type and snack purchasing we would expect 65 people to have watched an action film with snacks.
When using software, these calculated values will be labeled as “expected values,” “expected cell counts” or some similar term.
All of the expected counts for our data are larger than five, so we meet the requirement for applying the independence test.
Before calculating the test statistic, let’s look at the contingency table again. The expected counts use the row and column totals. If we look at each of the cells, we can see that some expected counts are close to the actual counts but most are not. If there is no relationship between the movie type and snack purchases, the actual and expected counts will be similar. If there is a relationship, the actual and expected counts will be different.
A common mistake with expected counts is to simply divide the grand total by the number of cells. For our movie data, this is 600 / 8 = 75. This is not correct. We know the row totals and column totals. These are fixed and cannot change for our data. The expected values are based on the row and column totals, not just on the grand total.
Performing the test
The basic idea in calculating the test statistic is to compare actual and expected values, given the row and column totals that we have in the data. First, we calculate the difference from actual and expected for each Movie-Snacks combination. Next, we square that difference. Squaring gives the same importance to combinations with fewer actual values than expected and combinations with more actual values than expected. Next, we divide by the expected value for the combination. We add up these values for each Movie-Snacks combination. This gives us our test statistic.
This is much easier to follow using the data from our example. Table 4 below shows the calculations for each Movie-Snacks combination carried out to two decimal places.
Squared Difference: 212.67
Divide by Expected: 212.67/64.58 = 3.29
Difference: 75 – 60.42 = 14.58
Squared Difference: 212.67
Divide by Expected: 212.67/60.42 = 3.52
ComedyActual: 125
Expected 155
Actual 175
Expected 145
Difference: 125 – 155 = -30
Squared Difference: 900
Divide by Expected: 900/155 = 5.81
Difference: 175 – 145 = 30
Squared Difference: 900
Divide by Expected: 900/145 = 6.21
FamilyActual: 90
Expected: 62
Actual: 30
Expected 58
Difference: 90 – 62 = 28
Squared Difference: 784
Divide by Expected: 784/62 = 12.65
Difference: 30 – 58 = -28
Squared Difference: 784
Divide by Expected: 784/58 = 13.52
HorrorActual: 45
Expected 28.42
Actual: 10
Expected 26.58
Difference: 45 – 28.42 = 16.58
Squared Difference: 275.01
Divide by Expected: 275.01/28.42 = 9.68
Difference: 10 – 26.58 = -16.58
Squared Difference: 275.01
Divide by Expected: 275.01/26.58 = 10.35
Lastly, to get our test statistic, we add the numbers in the final row for each cell:
$ 3.29 + 3.52 + 5.81 + 6.21 + 12.65 + 13.52 + 9.68 + 10.35 = 65.03 $
To make our decision, we compare the test statistic to a value from the
Chi-square distribution
. This activity involves five steps:
-
We decide on the risk we are willing to take of concluding that the two variables are not independent when in fact they are. For the movie data, we had decided prior to our data collection that we are willing to take a 5% risk of saying that the two variables – Movie Type and Snack Purchase – are not independent when they really are independent. In statistics-speak, we set the significance level, α, to 0.05.
-
We calculate a test statistic. As shown above, our test statistic is 65.03.
-
We find the critical value from the Chi-square distribution based on our degrees of freedom and our significance level. This is the value we expect if the two variables are independent.
-
The degrees of freedom depend on how many rows and how many columns we have. The degrees of freedom (df) are calculated as:
$ \text{df} = (r-1)\times(c-1) $
In the formula,
r
is the number of rows, and
c
is the number of columns in our contingency table. From our example, with Movie Type as the rows and Snack Purchase as the columns, we have:
$ \text{df} = (4-1)\times(2-1) = 3\times1 = 3 $
The Chi-square value with α = 0.05 and three degrees of freedom is 7.815.
-
We compare the value of our test statistic (65.03) to the Chi-square value. Since 65.03 > 7.815, we reject the idea that movie type and snack purchases are independent.
We conclude that there
is
some relationship between movie type and snack purchases. The owner of the movie theater cannot estimate how many snacks to buy regardless of the type of movies being shown. Instead, the owner must think about the type of movies being shown when estimating snack purchases.
It's important to note that we cannot conclude that the type of movie
causes
a snack purchase. The independence test tells us only whether there is a relationship or not; it does not tell us that one variable causes the other.
Figure 1: Bar chart showing the expected and actual counts for the different movie types