Skip to content

06. Chi-square analysis⚓︎

Module items⚓︎

R Script file⚓︎

Copy the code below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/06-chisquare.R", "06-chisquare.R"); 
file.edit("06-chisquare.R")

Lab assignment⚓︎

Chi-square

Sample lab assignment⚓︎

Sample: Chi-square

Learning outcomes⚓︎

  1. Learn chi-square analysis
  2. Learn how to run chi-square analysis
  3. Learn statistical significance

Suggested reading⚓︎

  • 📖
    Barnes, Sally, and Cathy Lewin. 2006. “An Introduction to Inferential Statistics: Testing for Differences and Relationships.” Pp. 226–35 in Research methods in the social sciences, edited by B. Somekh and C. Lewin. London: Sage.

Chi-square basics⚓︎

Relationship between preferred pet and respondents’ age group

Do you think there is a relationship between preferred pet and respondents’ age group?
In other words, do you think the preferred pet is influenced by respondents’ age group?

Age group Cat Dog Total
Younger 207
41.4%
293
58.6%
500
100%
Older 267
53.4%
233
46.6%
500
100%
Total 474
47.4%
526
52.6%
1000
100%
Show the answer
  • While the table shows differences in preferences between age groups (e.g., younger people prefer dogs more than older people), we CANNOT conclude that the differences are statistically significant.
    • To assess whether these differences are due to chance or a real association, we need a statistical test like the chi-square test.
  • The [[chi-square]]test is used to discover if there is a relationship between:
    • Two [[categorical]] variables.
    • The chi-square test can only compare categorical variables.
    • It cannot make comparisons between continuous variables or between categorical and continuous variables.

Functions of variables: Factor variable and outcome variable⚓︎

  • A [[variable]]:
    • Depending on the [[functions of the variable]] in the analysis is either:
      • [[Outcome variable]]: the main topic that we investigate. It is the outcome: What is being affected or changed? This is also called the dependent variable.
      • [[Factor variable]]: the variable used to explain or understand the outcome variable (main topic). This is also called the independent variable.
        • A factor variable is assumed to cause a change in the outcome variable.
    • If we consider the "relationship between preferred pet and respondents’ age group" table, we need to ask the following question:
      • In this table, what seems to change what?
        • Someone's preferred pet will not change their age group. However, the opposite could be the case: someone's age group may influence their preferred pet. Therefore:
          • Age group is the [[factor variable]],
          • Preferred pet is the [[outcome variable]].
  • We need to identify, logically, which variable plays which role, because the codes we use depend on this distinction.
    • Some variables can function as either a [[factor variable]] or an [[outcome variable]]:
      • For example, we could argue that the level of education increases someone’s income. In this case, education is the factor variable, and income is the outcome variable.
      • In another study, we could argue the opposite: income increases someone’s level of education. In this case, income is the factor variable, and education is the outcome variable.
        • These two research examples are completely different because their factor and outcome variables are defined differently.
    • However, some variables cannot realistically be outcome variables:
      • Nothing can change a person’s age.
      • Similarly, variables such as biological sex (in most survey contexts), place of birth, or past events (e.g., childhood conditions) cannot be outcomes because they are not influenced by other variables in the dataset.
      • These variables can only serve as [[factor variable]]s because they may influence other outcomes:
        • Age → may influence political attitudes, health status, or technology use.
        • Place of birth → may influence language ability or cultural preferences.
        • Biological sex → may influence income, occupation, or health outcomes.

[[Statistical significance]] and p-value⚓︎

  • This, and all the analyses in this site, will show the statistical significance, for example, if age group actually influence the preferred pet, meaning if those percentage differences are negligible or not.
  • Statistical significance
    • A measure of whether the research findings are meaningful. In other words, if the factor variable causes a change in the outcome variable in a statistically significant way. We will determine this using [[p-value]].
    • A [[p-value]] is a measure of the probability that an observed difference could have occurred just by random chance.
    • The lower the p-value, the greater the statistical significance of the observed difference.
      • All p-values are between 0 and 1;
        • The most reliable studies have p-values very close to 0.
        • A p-value of 0.001 means that there is a 1 in 1000 probability that the results are due to chance and do not reflect a real difference.
        • A p-value of 0.05 means there is a 5 in 100 probability that the results are due to chance.
        • When a p-value is 0.05 or below, the result is considered to be "statistically significant."
          • We refer to statistical significance as p < 0.05.

How to make sure p-value is significant?⚓︎

  • [[Is my p-value less than 0.05?]]
    • To determine [[statistical significance]]
  • [[Check asterisks]]
    • To determine [[statistical significance]]

      Check asterisks (*)

      • Our tables and figures will show the statistical significance of the p-values with asterisks.
      • If we see at least one asterisk (*), we will consider that result statistically significant.

        (1) * means p < 0.05
        (2) ** means p < 0.01
        (3) *** means p < 0.001

[[Chi-square]] specifics⚓︎

  • When both [[factor variable]] and [[outcome variable]] are [[categorical]], we conduct chi-square.
    • For example, we could wonder if income groups have a statistically significant effect on life satisfaction level:
      • Factor variable: Income groups [(1) Low, (2) Medium, (3)High],
      • Outcome variable: Life satisfaction level [(1)Not satisfied, (2) Moderately satisfied, (3) Satisfied],
        • If the p-value that chi-square test generates is lower than 0.05, we will assume that someone's income group would determine their life satisfaction level.

GSS example 1: Significant p-value (degree and health)⚓︎

Find the variables in Variables in GSS page⚓︎

  1. We wonder if respondents' education degree have a statistically significant influence on their perceived personal health quality.
  2. We want to make sure that degree and health are categorical variables.
  3. We check this information in the Variables in GSS page.
  4. Search   Ctrl+F  /  Cmd+F  for the variable names, degree and health. We see that they are categorical variables.

    Variable name Variable label Variable type Question wording and response categories
    degree Respondents' education degree Ordinal Do you have less than high school, high school, associate/junior college, bachelor's, or graduate degree?

    (0: Less than high school; 1: High school; 2: Associate/junior college; 3: Bachelor's; 4: Graduate)
    health

    From: Variables in GSS
    Perceived personal health quality Ordinal ✅ RECODE Would you say that in general your health is Excellent, Very good, Good, Fair, or Poor?

    (1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor)

[[Chi-square]] #code⚓︎

  • [[Model code]]

    sjt.xtab(gss$factor_here, gss$outcome_here, show.row.prc = TRUE) 
    

  • [[Working code]]

    sjt.xtab(gss$degree, gss$health, show.row.prc = TRUE)
    

    • Line 1: We will put degree here ➜ factor_here and health here ➜ outcome_here.
      • [[Factor]] varible first, [[outcome]] variable second.
      • Find the working code in this module's R script file. [[Highlighting and running]] this code will create the output below.

[[Chi-square]] #output⚓︎

Perceived personal health quality by respondents' education degree

Respondents'
education degree
Excellent Very Good Good Fair Total
Less than high school 39
11.1%
140
40%
134
38.3%
37
10.6%
350
100%
High school 242
13.4%
955
52.8%
520
28.7%
93
5.1%
1810
100%
Associate/junior college 55
15.3%
211
58.6%
81
22.5%
13
3.6%
360
100%
Bachelor's 188
21.9%
504
58.6%
148
17.2%
20
2.3%
860
100%
Graduate 139
24.1%
338
58.6%
91
15.8%
9
1.6%
577
100%
Total 663
16.8%
2148
54.3%
974
24.6%
172
4.3%
3957
100%
χ² = 201.239, df = 12, Cramer's V = 0.130, p=0.000****

[[Chi-square]] #interpretation⚓︎

Significant (p < 0.05) chi-square #interpretation template

The [[variable label]] of the [[factor]] variable variable has an effect on [[variable label]] of the [[outcome]] variable since the p-value is less than 0.05.

We can conclude that [[value label]] 1 of the factor variable and [[value label]] 2 of the actor variable... have/are/feel... substantially different [[variable label]] of the outcome variable.

Significant (p < 0.05) chi-square #interpretation sample

The respondents' education degree variable has an effect on perceived personal health quality since the p-value is less than 0.05.

We can conclude that respondents with less than high school, high school, associate/junior college, bachelor's, and graduate degree have substantially different perceived personal health quality.

GSS Example 2: Insignificant p-value (sex and happy)⚓︎

Find the variables in Variables in GSS page⚓︎

  1. We wonder if respondents' sex have a statistically significant influence on their perceived personal health quality.
  2. We want to make sure that sex and happy are categorical variables.
  3. We check this information in the Variables in GSS page.
  4. Search   Ctrl+F  /  Cmd+F  for the variable names, sex and happy. We see that they are categorical variables.

    Variable name Variable label Variable type Question wording and response categories
    sex Respondents' sex Binary What's your sex?

    (1: Male; 2: Female)
    happy Happiness level Ordinal, RECODE Would you say that you are very happy, pretty happy, or not too happy?

    (1: Very happy; 2: Pretty happy; 3: Not too happy)

[[Chi-square]] #code⚓︎

  • [[Model code]]

    sjt.xtab(gss$factor_here, gss$outcome_here, show.row.prc = TRUE) 
    

  • [[Working code]]

    sjt.xtab(gss$sex, gss$happy, show.row.prc = TRUE)
    

    • Line 1: We will put sex here ➜ factor_here and happy here ➜ outcome_here.
      • [[Factor]] varible first, [[outcome]] variable second.
      • Find the working code in this module's R script file. [[Highlighting and running]] this code will create the output below.

[[Chi-square]] #output⚓︎

Happiness level by respondents' sex

Respondents' sex Very happy Pretty happy Not too happy Total
Male 350
19.8%
1025
57.8%
397
22.4%
1772
100%
Female 441
20.5%
1264
58.7%
450
20.9%
2155
100%
Total 791
20.1%
2289
58.3%
847
21.6%
3927
100%
χ² = 1.399, df = 2, Cramer's V = 0.019, p=0.497

[[Chi-square]] #interpretation⚓︎

Insignificant (p > 0.05) chi-square #interpretation template

The [variable label of the factor variable] variable has no effect on [variable label of the outcome variable] since the p-value is less than 0.05.

We can conclude that [label 1 of the factor variable] and [label 2 of the actor variable]... have/are/feel... similar [variable label of the outcome variable].

Insignificant (p > 0.05) chi-square #interpretation sample

The respondents' sex variable has no effect on happiness level since the p-value is greater than 0.05.

We can conclude that males and females have similar happiness level.