10. Correlation analysis⚓︎

Module items⚓︎

R Script file⚓︎

[[Copy the code]] below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/10-correlation.R", "10-correlation.R"); 
file.edit("10-correlation.R")

Lab assignment⚓︎

Correlation

Sample lab assignment⚓︎

Sample: Correlation

Learning outcomes⚓︎

Learn the basic terms and concepts related to the correlation analysis
Learn how to conduct and interpret:
1. Bivariate correlation analysis (two variables):
  1. Correlation table
  2. Scatterplot graph
2. Multivariate correlation analysis (more than two variables)
  1. Correlation table matrix
  2. Scatterplot graph matrix

Correlation analysis specifics⚓︎

[[Correlation analysis]]
- Examines the relationship between continuous variables.
- Correlation is not causation: it does not show a causal relationship.
- Correlation yields two values:
  - [[p-value]], showing the significance of the relationship, and
  - [[r-value]], showing the strength and direction of the relationship.
    - r-value (correlation coefficient) is between minus one and plus one (-1 and +1).

Significance of correlation (with p-value)⚓︎

[[Significance of correlation]]
- Using the p-value, we determine if the correlation is:
  - (1) [[Nonsignificant correlation]]: When the p-value is greater than 0.05 (p > 0.05), it's a nonsignificant correlation. This means that there's no meaningful relationship between two continuous variables; such as height and education.
  - (2) [[Significant correlation]]: When the p-value is than 0.05 (p < 0.05), it's a significant correlation. This means that there's a meaningful relationship between two continuous variables; such as height and weight.
    - If we have a significant correlation, we continue with checking the r-value to see the direction and strength of correlation.

Direction of correlation (with r-value)⚓︎

[[Direction of correlation]]
- Using [[r-value]], we determine the direction of correlation. If the [[p-value]] is less than 0.05 (p < 0.05); and
  - If the r-value is positive (such as r=0.250), then, there's a:
    - [[Positive correlation]]: As one variable increases, so does the other variable.
  - If the r-value is negative (such as r= - 0.250), then, there's a:
    - [[Negative correlation]]: As one variable increases, the other variable decreases.

Strength of correlation (with r-value)⚓︎

[[Strength of correlation]]
- Using [[r-value]], we determine the strength of correlation.
  - r-value = less than |0.3| ➜ [[weak correlation]]
  - r-value = higher than |0.3| and less than |0.5| ➜ [[moderate correlation]]
  - r-value = greater than |0.5| ➜ [[strong correlation]]
The r-value is an absolute number. That means;
- r= -.673 is stronger than r= .567 (negative .637 r-value is stronger than positive .567 r-value)
- r= -.432 is stronger than r= .322 (negative .432 r-value is stronger than positive .322 r-value)
- r= .567 is stronger than r= -.322 (negative .567 r-value is stronger than negative -.322 r-value)

Guessing correlation type exercise⚓︎

We will try guessing the correlation type if they are;
1. Height and weight

Is there a correlation between height and weight? If yes, positive or negative?
Show the answer
- The correlation between the height of an individual and their weight tends to be positive.
- In other words, individuals who are taller also tend to weigh more.
2. Time spent watching TV and exam scores

Is there a correlation between time spent watching TV and exam scores? If yes, positive or negative?
Show the answer
- The more time a student spends watching TV, the lower their exam scores tend to be.
- Time spent watching TV and the variable exam score have a negative correlation. As time spent watching TV increases, exam scores decrease.
3. Coffee consumption and intelligence

Is there a correlation between coffee consumption and intelligence? If yes, positive or negative?
Show the answer
- The amount of coffee that individuals consume and their IQ level are unrelated.
- In other words, knowing how much coffee an individual drinks doesn’t give us an idea of what their IQ level might be.
4. Temperature and ice cream sales

Is there a correlation between temperature and ice cream sales? If yes, positive or negative?
Show the answer
- The correlation between the temperature and total ice cream sales is positive.
- In other words, when it’s hotter outside the total ice cream sales of companies tends to be higher since more people buy ice cream when it’s hot out.
5. temperature and frequency of sunburn

Is there a correlation between temperature and frequency of sunburn? If yes, positive or negative?
Show the answer
- The correlation between the temperature and the frequency of sunburn is positive.
- In other words, when it’s hotter outside the frequency of sunburn is more likely.
6. frequency of sunburn and ice cream sales

Is there a correlation between the frequency of sunburn and ice cream sales? If yes, positive or negative?
Show the answer
- The correlation between the frequency of sunburn and total ice cream sales is positive.
- HOWEVER, ice cream consumption does not cause sunburns or getting a sunburn doesn’t make someone eat more ice cream.
- Both of these variables, ice cream consumption and sunburn frequency, are higher when it’s hotter outside.

Confounding variables⚓︎

[[Confounding variable]]
- A third variable, "temperature", that affects both variables, "frequency of sunburn" and "ice cream sales".
  - If one is not careful, it can make it appear that there is a correlation between two variables that are actually both independently being influenced by this third variable, "temperature".
- Other confounding variable examples:
  - Number of schools in a city and number of crimes in a city
    - Confounding Variable: City population
  - Shoe size and reading ability in children
    - Confounding Variable: Age of the child
  - Outdoor exercise frequency and vitamin D levels
    - Confounding Variable: Sunlight exposure

Bivariate correlation⚓︎

[[Bivariate correlation]]
- Shows the relationship between two continuous variables in a table-graph format:
  - [[Correlation table]] shows this relationship in a table format:
    - Reports the correlation coefficient (r-value) and statistical significance (p-value).
  - [[Scatterplot graph]] shows this relationship in a graph format:
    - In addition to the r-value and p-value, displays the shape of the relationship.

Example 1: Significant and negative correlation⚓︎

Find the variables in Variables in GSS page⚓︎

We may wonder if there's a correlation between the respondents' education in years (educ) and television screen time in hours (tvhours).

Here we do not propose any kind of cause-and-effect.

flowchart LR
subgraph F["Continuous variable"]
    A[Respondents' education in years]
end

subgraph O["Continuous variable"]
    B[Television screen time in hours]
end

A <==>|May have a relationship| B

We want to make sure that educ and tvhours are continuous variables.

Variable name	Variable label	Variable type	Question wording and response categories
`educ`	Respondents' education in years	Continuous	What is the highest year of school you completed?
`tvhours` From: Variables in GSS	Television screen time in hours	Continuous	On the average day, how many hours do you personally watch television?

[[Correlation table]] #code⚓︎

[[Model code]]

tab_corr (gss[c("variable1_here", "variable2_here")],
p.numeric = T, triangle="lower")

[[Working code]]
1 2
tab_corr (gss[c("educ", "tvhours")], p.numeric = T, triangle="lower")
- Line 1: We put educ here ➜ variable1_here and tvhours here ➜ variable2_here.
  - The order of the variables doesn't matter.
  - Make sure there's no space after the variable name, such as "educ ".
  - Find the working code in this module's R script file. [[Highlighting and running]] this code will generate the output below.

[[Correlation table]] #output⚓︎

	Respondents' education in years	Television screen time in hours
Respondents' education in years
Television screen time in hours	r = -0.163 p = <0.001***

[[Scatterplot graph]] #code⚓︎

[[Model code]]

scatterplot(gss, "variable1_here", "variable2_here")

[[Working code]]
1
scatterplot(gss, "educ", "tvhours")
- Line 1: We put educ here ➜ variable1_here and tvhours here ➜ variable2_here.
  - The order of the variables doesn't matter.
  - Make sure there's no space after the variable name, such as "tvhours ".
  - Find the working code in this module's R script file. [[Highlighting and running]] this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Significant (p < 0.05) and weak (r < |0.3|) correlation interpretation template

There is a significant correlation between [variable 1 label] and [variable 2 label] since the p-value is less than .05.

This correlation is negative and weak since the r-value is -0.xxx (less than |0.3|).

OR This correlation is negative and moderate since the r-value is -0.xxx (between |0.3| and |0.5|).

OR This correlation is negative and strong since the r-value is -0.xxx (higher than |0.5|).

This means that as [variable 1 label] increases [variable 2 label] decreases, and vice versa.

Significant (p < 0.05) and weak (r < |0.3|) correlation interpretation sample

There is a significant correlation between respondents' education in years and television screen time in hours since the p-value is less than .05.

This correlation is negative and weak since the r-value is -0.163 (less than |0.3|).

This means that as respondents' education in years increases television screen time in hours decreases, and vice versa.

We first check the p-value.
- If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value. If it's less than 0.05, it's a significant correlation.
  - This analysis yields a [[significant correlation]] since the p-value < 0.05.
If it's a significant correlation, then we check the r-value.
- Check the direction of the correlation: if there's a minus (-), it's a negative correlation; if there's no minus, it's a positive correlation.
  - This analysis yields a [[negative correlation]], -0.163.
- Check the strength of the correlation. -0.163 is less than |0.3|, so this is a [[weak correlation]].
  - Remember, r-values are absolute numbers. It wouldn't matter if it is -0.163 or 0.163, it's still weak correlation.

Example 2: Significant and positive correlation⚓︎

Find the variables in Variables in GSS page⚓︎

We may wonder if there's a correlation between the respondents' mothers' socio-economic index score (masei10) and fathers' socio-economic index score (pasei10).
- Here we do not propose any kind of cause-and-effect.
```
flowchart LR
subgraph F["Continuous variable"]
    A[Mothers' SES]
end

subgraph O["Continuous variable"]
    B[Fathers' SES]
end

A <==>|May have a relationship| B
```

We want to make sure that masei10 and pasei10 are continuous variables.

Variable name	Variable label	Variable type	Question wording and response categories
`masei10`	Respondents' mothers' socio-economic index score	Continuous	Respondent's mother's socio-economic index score (calculated) (Min: 9, Max: 92.8)
`pasei10` From: Variables in GSS	Respondents' fathers' socio-economic index score	Continuous	Respondent's father's socio-economic index score (calculated) (Min: 9, Max: 93.7)

[[Correlation table]] #code⚓︎

Model code

tab_corr (gss[c("variable1_here", "variable2_here")],
p.numeric = T, triangle="lower")

Working code
1 2
tab_corr (gss[c("masei10", "pasei10")], p.numeric = T, triangle="lower")
- Line 1: We put masei10 here ➜ variable1_here and pasei10 here ➜ variable2_here.
  - The order of the variables doesn't matter.
  - Make sure there's no space after the variable name, such as "masei10 ".
  - Highlighting and running this code will generate the output below.

[[Correlation table]] #output⚓︎

	Respondents' mothers' socio-economic index score	Respondents' fathers' socio-economic index score
Respondents' mothers' socio-economic index score
Respondents' fathers' socio-economic index score	r = 0.364 p = 0.001***

[[Scatterplot graph]] #code⚓︎

Model code

scatterplot(gss, "variable1_here", "variable2_here")

Working code
1
scatterplot(gss, "masei10", "pasei10")
- Line 1: We put masei10 here ➜ variable1_here and pasei10 here ➜ variable2_here.
  - The order of the variables doesn't matter.
  - Make sure there's no space after the variable name, such as "masei10 ".
  - Highlighting and running this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Significant (p < 0.05) and moderate (|0.3| < | r | < |0.5|) correlation interpretation template

There is a significant correlation between [variable 1 label] and [variable 2 label] since the p-value is less than 0.05.

This correlation is positive and moderate since the r-value is 0.xxx (|0.3| < | r | < |0.5|).

OR This correlation is positive and weak since the r-value is -0.xxx (less than |0.3|).

OR This correlation is positive and strong since the r-value is -0.xxx (higher than |0.5|).

This means that [variable 1 label] and [variable 2 label] increase and decrease together.

Significant (p < 0.05) and moderate (|0.3| < | r | < |0.5|) correlation interpretation sample

There is a significant correlation between respondents' mothers' socio-economic index score and respondents' mothers' socio-economic index score since the p-value is less than .05.

This correlation is positive and moderate since the r-value is 0.364 (|0.3| < | r | < |0.5|).

This means that respondents' mothers' socio-economic index score and respondents' mothers' socio-economic index score increase and decrease together.

We first check the p-value.
- If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value. If it's less than 0.05, it's a significant correlation.
  - This analysis yields a [[significant correlation]] since the p-value < 0.05.
If it's a significant correlation, then we check the r-value.
- Check the direction of the correlation: if there's a minus (-), it's a negative correlation; if there's no minus, it's a positive correlation.
  - This analysis yields a [[positive correlation]], 0.364.
- Check the strength of the correlation. 0.364 is between |0.3| and |0.5|, so this is a [[moderate correlation]].
  - Remember, r-values are absolute numbers. It wouldn't matter if it is 0.364 or -0.3643, it's still moderate correlation.

Example 3: Nonsignificant correlation⚓︎

Step 1: Find the variables in Variables in GSS page⚓︎

We may wonder if there's a correlation between the number of hours respondents worked last week (hrs1) and the number of brothers and sisters respondents have (sibs).
- Here we do not propose any kind of cause-and-effect.
```
flowchart LR
subgraph F["Continuous variable"]
    A[Hours worked last week]
end

subgraph O["Continuous variable"]
    B[# of brothers and sisters]
end

A <==>|May have a relationship| B
```

We want to make sure that hrs1 and sibs are continuous variables.

Variable name	Variable label	Variable type	Question wording and response categories
`hrs1`	Number of hours respondents worked last week	Continuous	How many hours did you work last week, at all jobs?
`sibs` From: Variables in GSS	Number of brothers and sisters respondents have	Continuous	How many brothers and sisters do you have?

[[Correlation table]] #code⚓︎

Model code

tab_corr (gss[c("variable1_here", "variable2_here")],
p.numeric = T, triangle="lower")

Working code

tab_corr (gss[c("hrs1", "sibs")], 
p.numeric = T, triangle="lower")

Line 1: We put hrs1 here ➜ variable1_here and sibs here ➜ variable2_here.
- The order of the variables doesn't matter.
- Make sure there's no space after the variable name, such as "sibs ".
- Highlighting and running this code will generate the output below.

[[Correlation table]] #output⚓︎

	Number of hours respondents worked last week	Number of brothers and sisters respondents have
Number of hours respondents worked last week
Number of brothers and sisters respondents have	r = -0.039 p = 0.068

[[Scatterplot graph]] #code⚓︎

Model code

scatterplot(gss, "variable1_here", "variable2_here")

Working code
1
scatterplot(gss, "hrs1", "sibs")
Line 1: We put hrs1 here ➜ variable1_here and sibs here ➜ variable2_here.
- The order of the variables doesn't matter.
- Make sure there's no space after the variable name, such as "sibs ".
- Highlighting and running this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

Scatterplot showing little to no relationship between number of hours worked last week and number of brothers and sisters respondents have. The smoothed line rises from about 2.7 siblings at 0 hours to just above 3.1 around 25 hours, drops to about 2.5 near 45 hours, then fluctuates slightly and ends near 2.6; the annotation reports a very weak, non-significant correlation, r = -0.039, p = 0.068.

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Nonsignificant correlation interpretation template

There is a no significant correlation between [variable 1 label] and [variable 2 label] since the p-value is greater than .05.

This means that [variable 1 label] and [variable 2 label] are not related.

Nonsignificant correlation interpretation sample

There is a no significant correlation between the number of hours respondents worked last week and the number of brothers and sisters respondents have since the p-value is greater than .05.

This means that the number of hours respondents worked last week and the number of brothers and sisters respondents have are not related.

We first check the p-value. If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value.
- This analysis yields a [[nonsignificant correlation]] since the p-value > 0.05.

Multivariate correlation⚓︎

[[Multivariate correlation]]
- Shows the relationships among multiple (more than two) continuous variables in a table-graph format.
  - [[Correlation table matrix]] shows these relationships in a table format:
    - Reports the correlation coefficients (r-value) and statistical significance (p-value).
  - [[Scatterplot graph matrix]] shows these relationships in a graph format:
    - In addition to the r-value and p-value, displays the shape of the relationship.

Example: Correlation table matrix⚓︎

Find the variables in Variables in GSS page⚓︎

We'll use all the variables we've used so far in the previous sections.

Variable name	Variable label	Variable type	Question wording and response categories
`educ`	Respondents' education in years	Continuous	What is the highest year of school you completed?
`tvhours`	Television screen time in hours	Continuous	On the average day, how many hours do you personally watch television?
`masei10`	Respondents' mothers' socio-economic index score	Continuous	Respondent's mother's socio-economic index score (calculated) (Min: 0; Max: 100)
`pasei10`	Respondents' fathers' socio-economic index score	Continuous	Respondent's father's socio-economic index score (calculated) (Min: 0; Max: 100)
`hrs1`	Number of hours respondents worked last week	Continuous	How many hours did you work last week, at all jobs?
`sibs`	Number of brothers and sisters respondents have	Continuous	How many brothers and sisters do you have?

[[Correlation table matrix]] #code⚓︎

Model code

tab_corr (gss[, c("variable1_here", "variable2_here", "variable3_here", "variable4_here", "variable5_here", "variable6_here")],  
p.numeric = T, triangle="lower", na.deletion = "pairwise")

Working code

tab_corr (gss[, c("educ", "tvhours", "masei10", "pasei10", "hrs1", "sibs")], 
p.numeric = T, triangle="lower", na.deletion = "pairwise")

Line 1: We put the previous variables here ➜ variable1_here and variable2_here and so on.
- The order of the variables doesn't matter.
- Highlighting and running this code will generate the table below.

[[Correlation table matrix]] #output⚓︎

	Respondents' education	TV screen time	Mothers' SES	Fathers' SES	Hours worked last week
Respondents' education
TV screen time	r = -0.163 p = 0.001***
Mothers' SES	r = 0.287 p = 0.001***	r = -0.125 p = 0.001***
Fathers' SES	r = 0.353 p = 0.001***	r = -0.123 p = 0.001***	r = 0.364 p = 0.001***
Hours worked last week	r = 0.026 p = 0.226	r = -0.029 p = 0.265	r = 0.034 p = 0.170	r = -0.024 p = 0.338
Number of siblings	r = -0.208 p = 0.001***	r = 0.104 p = 0.001***	r = -0.221 p = 0.001***	r = -0.187 p = 0.001***	r = -0.039 p = 0.068

[[Scatterplot graph matrix]] #code⚓︎

scatterplot_matrix <- gss[, c("variable1_here", "variable2_here", "variable3_here", "variable4_here", "variable5_here", "variable6_here")]
pairs_panels_pval(scatterplot_matrix, color = "#15616d")

scatterplot_matrix <- gss[, c("educ", "tvhours", "masei10", "pasei10", "hrs1", "sibs")]
pairs_panels_pval(scatterplot_matrix, color = "#15616d")

Line 1: We put the previous variables here ➜ variable1_here and variable2_here and so on.
- The order of the variables doesn't matter.
- Highlighting and running this code will generate the output below.

[[Scatterplot graph matrix]] #output⚓︎

(1) Start with the diagonal:
- Find your two variables of interest: here, for example, tvhours and educ.
- The diagonal cells (in red) show each variable's distribution as a histogram.
(2) Look upper-right:
- The upper-right cell (in purple) at the intersection of your two variables reports the correlation coefficient (r-value) and statistical significance (p-value). Here, r=-0.029, p=0.265.
(3) Look lower-left:
- The lower-left cell (in blue) at the same intersection shows the scatterplot.

[[Correlation table matrix]] and [[scatterplot graph matrix]] #interpretation⚓︎

Nonsignificant correlation interpretation template

There is a no significant correlation between [variable 1 label] and [variable 2 label] since the p-value is greater than .05.

This means that [variable 1 label] and [variable 2 label] are not related.

Nonsignificant correlation interpretation sample

There is a no significant correlation between the television screen time in hours and the number of hours respondents worked last week since the p-value is greater than .05.

This means that the television screen time in hours and the number of hours respondents worked last week are not related.

We first check the p-value. If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value.
- This analysis yields a [[nonsignificant correlation]] since the p-value > 0.05.

10. Correlation analysis⚓︎

Module items⚓︎

R Script file⚓︎

Lab assignment⚓︎

Sample lab assignment⚓︎

Learning outcomes⚓︎

Suggested reading⚓︎

Correlation analysis specifics⚓︎

Significance of correlation (with p-value)⚓︎

Direction of correlation (with r-value)⚓︎

Strength of correlation (with r-value)⚓︎

Guessing correlation type exercise⚓︎

Confounding variables⚓︎

Bivariate correlation⚓︎

Example 1: Significant and negative correlation⚓︎

Find the variables in Variables in GSS page⚓︎

[[Correlation table]] #code⚓︎

[[Correlation table]] #output⚓︎

[[Scatterplot graph]] #code⚓︎

[[Scatterplot graph]] #output⚓︎

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Example 2: Significant and positive correlation⚓︎

Find the variables in Variables in GSS page⚓︎

[[Correlation table]] #code⚓︎

[[Correlation table]] #output⚓︎

[[Scatterplot graph]] #code⚓︎

[[Scatterplot graph]] #output⚓︎

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Example 3: Nonsignificant correlation⚓︎

Step 1: Find the variables in Variables in GSS page⚓︎

[[Correlation table]] #code⚓︎

[[Correlation table]] #output⚓︎

[[Scatterplot graph]] #code⚓︎

[[Scatterplot graph]] #output⚓︎

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Multivariate correlation⚓︎

Example: Correlation table matrix⚓︎

Find the variables in Variables in GSS page⚓︎

[[Correlation table matrix]] #code⚓︎

[[Correlation table matrix]] #output⚓︎

[[Scatterplot graph matrix]] #code⚓︎

[[Scatterplot graph matrix]] #output⚓︎

[[Correlation table matrix]] and [[scatterplot graph matrix]] #interpretation⚓︎