06. Chi-square analysis⚓︎
Module items⚓︎
R Script file⚓︎
Copy the code below ➜ Paste into [[RStudio console]] ➜ Hit Enter
source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R"));
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/06-chisquare.R", "06-chisquare.R");
file.edit("06-chisquare.R")
Lab assignment⚓︎
Sample lab assignment⚓︎
Learning outcomes⚓︎
- Learn chi-square analysis
- Learn how to run chi-square analysis
- Learn statistical significance
Suggested reading⚓︎
- 📖
Barnes, Sally, and Cathy Lewin. 2006. “An Introduction to Inferential Statistics: Testing for Differences and Relationships.” Pp. 226–35 in Research methods in the social sciences, edited by B. Somekh and C. Lewin. London: Sage.
Chi-square basics⚓︎
Relationship between preferred pet and respondents’ age group
Do you think there is a relationship between preferred pet and respondents’ age group?
In other words, do you think the preferred pet is influenced by respondents’ age group?
| Age group | Cat | Dog | Total |
|---|---|---|---|
| Younger | 207 41.4% |
293 58.6% |
500 100% |
| Older | 267 53.4% |
233 46.6% |
500 100% |
| Total | 474 47.4% |
526 52.6% |
1000 100% |
Show the answer
- While the table shows differences in preferences between age groups (e.g., younger people prefer dogs more than older people), we CANNOT conclude that the differences are statistically significant.
- To assess whether these differences are due to chance or a real association, we need a statistical test like the chi-square test.
- The [[chi-square]]test is used to discover if there is a relationship between:
- Two [[categorical]] variables.
- The chi-square test can only compare categorical variables.
- It cannot make comparisons between continuous variables or between categorical and continuous variables.
Functions of variables: Factor variable and outcome variable⚓︎
- A [[variable]]:
- Depending on the [[functions of the variable]] in the analysis is either:
- [[Outcome variable]]: the main topic that we investigate. It is the outcome: What is being affected or changed? This is also called the dependent variable.
- [[Factor variable]]: the variable used to explain or understand the outcome variable (main topic). This is also called the independent variable.
- A factor variable is assumed to cause a change in the outcome variable.
- If we consider the "relationship between preferred pet and respondents’ age group" table, we need to ask the following question:
- In this table, what seems to change what?
- Someone's preferred pet will not change their age group. However, the opposite could be the case: someone's age group may influence their preferred pet. Therefore:
- Age group is the [[factor variable]],
- Preferred pet is the [[outcome variable]].
- Someone's preferred pet will not change their age group. However, the opposite could be the case: someone's age group may influence their preferred pet. Therefore:
- In this table, what seems to change what?
- Depending on the [[functions of the variable]] in the analysis is either:
- We need to identify, logically, which variable plays which role, because the codes we use depend on this distinction.
- Some variables can function as either a [[factor variable]] or an [[outcome variable]]:
- For example, we could argue that the level of education increases someone’s income. In this case, education is the factor variable, and income is the outcome variable.
- In another study, we could argue the opposite: income increases someone’s level of education. In this case, income is the factor variable, and education is the outcome variable.
- These two research examples are completely different because their factor and outcome variables are defined differently.
- However, some variables cannot realistically be outcome variables:
- Nothing can change a person’s age.
- Similarly, variables such as biological sex (in most survey contexts), place of birth, or past events (e.g., childhood conditions) cannot be outcomes because they are not influenced by other variables in the dataset.
- These variables can only serve as [[factor variable]]s because they may influence other outcomes:
- Age → may influence political attitudes, health status, or technology use.
- Place of birth → may influence language ability or cultural preferences.
- Biological sex → may influence income, occupation, or health outcomes.
- Some variables can function as either a [[factor variable]] or an [[outcome variable]]:
[[Statistical significance]] and p-value⚓︎
- This, and all the analyses in this site, will show the statistical significance, for example, if age group actually influence the preferred pet, meaning if those percentage differences are negligible or not.
- Statistical significance
- A measure of whether the research findings are meaningful. In other words, if the factor variable causes a change in the outcome variable in a statistically significant way. We will determine this using [[p-value]].
- A [[p-value]] is a measure of the probability that an observed difference could have occurred just by random chance.
- The lower the p-value, the greater the statistical significance of the observed difference.
- All p-values are between 0 and 1;
- The most reliable studies have p-values very close to 0.
- A p-value of 0.001 means that there is a 1 in 1000 probability that the results are due to chance and do not reflect a real difference.
- A p-value of 0.05 means there is a 5 in 100 probability that the results are due to chance.
- When a p-value is 0.05 or below, the result is considered to be "statistically significant."
- We refer to statistical significance as p < 0.05.
- All p-values are between 0 and 1;
How to make sure p-value is significant?⚓︎
- [[Is my p-value less than 0.05?]]
- To determine [[statistical significance]]
- Use this website: www.whichnumberislarger.com.
- Put the p-value in the first box, put 0.05 in the second box.
- Use this website: www.whichnumberislarger.com.
- To determine [[statistical significance]]
- [[Check asterisks]]
-
To determine [[statistical significance]]
Check asterisks (*)
- Our tables and figures will show the statistical significance of the p-values with asterisks.
-
If we see at least one asterisk (*), we will consider that result statistically significant.
(1) * means p < 0.05
(2) ** means p < 0.01
(3) *** means p < 0.001
-
[[Chi-square]] specifics⚓︎
- When both [[factor variable]] and [[outcome variable]] are [[categorical]], we conduct chi-square.
- For example, we could wonder if income groups have a statistically significant effect on life satisfaction level:
- Factor variable: Income groups [(1) Low, (2) Medium, (3)High],
- Outcome variable: Life satisfaction level [(1)Not satisfied, (2) Moderately satisfied, (3) Satisfied],
- If the p-value that chi-square test generates is lower than 0.05, we will assume that someone's income group would determine their life satisfaction level.
- For example, we could wonder if income groups have a statistically significant effect on life satisfaction level:
GSS example 1: Significant p-value (degree and health)⚓︎
Find the variables in Variables in GSS page⚓︎
- We wonder if respondents' education degree have a statistically significant influence on their perceived personal health quality.
- We want to make sure that
degreeandhealthare categorical variables. - We check this information in the Variables in GSS page.
-
Search Ctrl+F / Cmd+F for the variable names,
degreeandhealth. We see that they are categorical variables.Variable name Variable label Variable type Question wording and response categories degreeRespondents' education degree Ordinal Do you have less than high school, high school, associate/junior college, bachelor's, or graduate degree?
(0: Less than high school; 1: High school; 2: Associate/junior college; 3: Bachelor's; 4: Graduate)health
From: Variables in GSSPerceived personal health quality Ordinal ✅ RECODE Would you say that in general your health is Excellent, Very good, Good, Fair, or Poor?
(1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor)
[[Chi-square]] #code⚓︎
-
[[Model code]]
-
[[Working code]]
- Line 1: We will put
degreehere ➜factor_hereandhealthhere ➜outcome_here.- [[Factor]] varible first, [[outcome]] variable second.
- Find the working code in this module's R script file. [[Highlighting and running]] this code will create the output below.
- Line 1: We will put
[[Chi-square]] #output⚓︎
Perceived personal health quality by respondents' education degree
| Respondents' education degree |
Excellent | Very Good | Good | Fair | Total |
|---|---|---|---|---|---|
| Less than high school | 39 11.1% |
140 40% |
134 38.3% |
37 10.6% |
350 100% |
| High school | 242 13.4% |
955 52.8% |
520 28.7% |
93 5.1% |
1810 100% |
| Associate/junior college | 55 15.3% |
211 58.6% |
81 22.5% |
13 3.6% |
360 100% |
| Bachelor's | 188 21.9% |
504 58.6% |
148 17.2% |
20 2.3% |
860 100% |
| Graduate | 139 24.1% |
338 58.6% |
91 15.8% |
9 1.6% |
577 100% |
| Total | 663 16.8% |
2148 54.3% |
974 24.6% |
172 4.3% |
3957 100% |
| χ² = 201.239, df = 12, Cramer's V = 0.130, p=0.000**** |
[[Chi-square]] #interpretation⚓︎
Significant (p < 0.05) chi-square #interpretation template
The [[variable label]] of the [[factor]] variable variable has an effect on [[variable label]] of the [[outcome]] variable since the p-value is less than 0.05.
We can conclude that [[value label]] 1 of the factor variable and [[value label]] 2 of the actor variable... have/are/feel... substantially different [[variable label]] of the outcome variable.
Significant (p < 0.05) chi-square #interpretation sample
The respondents' education degree variable has an effect on perceived personal health quality since the p-value is less than 0.05.
We can conclude that respondents with less than high school, high school, associate/junior college, bachelor's, and graduate degree have substantially different perceived personal health quality.
GSS Example 2: Insignificant p-value (sex and happy)⚓︎
Find the variables in Variables in GSS page⚓︎
- We wonder if respondents' sex have a statistically significant influence on their perceived personal health quality.
- We want to make sure that
sexandhappyare categorical variables. - We check this information in the Variables in GSS page.
-
Search Ctrl+F / Cmd+F for the variable names,
sexandhappy. We see that they are categorical variables.Variable name Variable label Variable type Question wording and response categories sexRespondents' sex Binary What's your sex?
(1: Male; 2: Female)happyHappiness level Ordinal, RECODE Would you say that you are very happy, pretty happy, or not too happy?
(1: Very happy; 2: Pretty happy; 3: Not too happy)
[[Chi-square]] #code⚓︎
-
[[Model code]]
-
[[Working code]]
- Line 1: We will put
sexhere ➜factor_hereandhappyhere ➜outcome_here.- [[Factor]] varible first, [[outcome]] variable second.
- Find the working code in this module's R script file. [[Highlighting and running]] this code will create the output below.
- Line 1: We will put
[[Chi-square]] #output⚓︎
Happiness level by respondents' sex
| Respondents' sex | Very happy | Pretty happy | Not too happy | Total |
|---|---|---|---|---|
| Male | 350 19.8% |
1025 57.8% |
397 22.4% |
1772 100% |
| Female | 441 20.5% |
1264 58.7% |
450 20.9% |
2155 100% |
| Total | 791 20.1% |
2289 58.3% |
847 21.6% |
3927 100% |
| χ² = 1.399, df = 2, Cramer's V = 0.019, p=0.497 |
[[Chi-square]] #interpretation⚓︎
Insignificant (p > 0.05) chi-square #interpretation template
The [variable label of the factor variable] variable has no effect on [variable label of the outcome variable] since the p-value is less than 0.05.
We can conclude that [label 1 of the factor variable] and [label 2 of the actor variable]... have/are/feel... similar [variable label of the outcome variable].
Insignificant (p > 0.05) chi-square #interpretation sample
The respondents' sex variable has no effect on happiness level since the p-value is greater than 0.05.
We can conclude that males and females have similar happiness level.