Skip to content

10. Correlation analysis⚓︎

Module items⚓︎

R Script file⚓︎

[[Copy the code]] below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/10-correlation.R", "10-correlation.R"); 
file.edit("10-correlation.R")

Lab assignment⚓︎

Correlation

Sample lab assignment⚓︎

Sample: Correlation

Learning outcomes⚓︎

  1. Learn the basic terms and concepts related to the correlation analysis
  2. Learn how to conduct and interpret:
    1. Bivariate correlation analysis (two variables):
      1. Correlation table
      2. Scatterplot graph
    2. Multivariate correlation analysis (more than two variables)
      1. Correlation table matrix
      2. Scatterplot graph matrix

Suggested reading⚓︎

  • 📖
    Weinstein, Jay A. 2010. “Correlation Description and Induction in Comparative Sociology.” Pp. 297–324 in Applying social statistics: An introduction to quantitative reasoning in sociology. Lanham: Rowman & Littlefield.

Correlation analysis specifics⚓︎

  • [[Correlation analysis]]
    • Examines the relationship between continuous variables.
    • Correlation is not causation: it does not show a causal relationship.
    • Correlation yields two values:
      • [[p-value]], showing the significance of the relationship, and
      • [[r-value]], showing the strength and direction of the relationship.
        • r-value (correlation coefficient) is between minus one and plus one (-1 and +1).

Significance of correlation (with p-value)⚓︎

  • [[Significance of correlation]]
    • Using the p-value, we determine if the correlation is:

      • (1) [[Nonsignificant correlation]]: When the p-value is greater than 0.05 (p > 0.05), it's a nonsignificant correlation. This means that there's no meaningful relationship between two continuous variables; such as height and education.

        Figure showing no correlation: points scattered without a clear trend.

      • (2) [[Significant correlation]]: When the p-value is than 0.05 (p < 0.05), it's a significant correlation. This means that there's a meaningful relationship between two continuous variables; such as height and weight.

        • If we have a significant correlation, we continue with checking the r-value to see the direction and strength of correlation.

Direction of correlation (with r-value)⚓︎

  • [[Direction of correlation]]
    • Using [[r-value]], we determine the direction of correlation. If the [[p-value]] is less than 0.05 (p < 0.05); and
      • If the r-value is positive (such as r=0.250), then, there's a:
        • [[Positive correlation]]: As one variable increases, so does the other variable.
      • If the r-value is negative (such as r= - 0.250), then, there's a:
        • [[Negative correlation]]: As one variable increases, the other variable decreases.

          Two scatterplots illustrate types of correlation: positive correlation shows points trending upward from left to right, negative correlation shows points trending downward.

Strength of correlation (with r-value)⚓︎

  • [[Strength of correlation]]

    • Using [[r-value]], we determine the strength of correlation.
      • r-value = less than |0.3| ➜ [[weak correlation]]
      • r-value = higher than |0.3| and less than |0.5| ➜ [[moderate correlation]]
      • r-value = greater than |0.5| ➜ [[strong correlation]]

    Grid of six example scatterplots showing positive and negative correlations at weak, moderate, and strong levels. The top row shows positive correlations and the bottom row shows negative correlations, with point patterns becoming tighter around the trend line as the relationship gets stronger.

    The r-value is an absolute number. That means;

    • r= -.673 is stronger than r= .567 (negative .637 r-value is stronger than positive .567 r-value)
    • r= -.432 is stronger than r= .322 (negative .432 r-value is stronger than positive .322 r-value)
    • r= .567 is stronger than r= -.322 (negative .567 r-value is stronger than negative -.322 r-value)

Guessing correlation type exercise⚓︎

  • We will try guessing the correlation type if they are;

    1. Height and weight

    Is there a correlation between height and weight? If yes, positive or negative?

    Show the answer
    • The correlation between the height of an individual and their weight tends to be positive.
    • In other words, individuals who are taller also tend to weigh more.

    A scatterplot shows individual data points with height on the horizontal axis and weight on the vertical axis. The points form an upward trend, indicating that higher height values are associated with higher weight values.

    2. Time spent watching TV and exam scores

    Is there a correlation between time spent watching TV and exam scores? If yes, positive or negative?

    Show the answer
    • The more time a student spends watching TV, the lower their exam scores tend to be.
    • Time spent watching TV and the variable exam score have a negative correlation. As time spent watching TV increases, exam scores decrease.

    A scatterplot shows exam score on the vertical axis and time spent watching TV on the horizontal axis. The points trend downward, indicating that higher amounts of TV watching are associated with lower exam scores.

    3. Coffee consumption and intelligence

    Is there a correlation between coffee consumption and intelligence? If yes, positive or negative?

    Show the answer
    • The amount of coffee that individuals consume and their IQ level are unrelated.
    • In other words, knowing how much coffee an individual drinks doesn’t give us an idea of what their IQ level might be.

    A scatterplot shows IQ on the vertical axis and coffee consumption on the horizontal axis. The points are widely scattered with no clear upward or downward pattern, indicating no apparent relationship.

    4. Temperature and ice cream sales

    Is there a correlation between temperature and ice cream sales? If yes, positive or negative?

    Show the answer
    • The correlation between the temperature and total ice cream sales is positive.
    • In other words, when it’s hotter outside the total ice cream sales of companies tends to be higher since more people buy ice cream when it’s hot out.

    A scatterplot shows ice cream sales on the vertical axis and temperature on the horizontal axis. The points trend upward, indicating that higher temperatures are associated with higher ice cream sales.

    5. temperature and frequency of sunburn

    Is there a correlation between temperature and frequency of sunburn? If yes, positive or negative?

    Show the answer
    • The correlation between the temperature and the frequency of sunburn is positive.
    • In other words, when it’s hotter outside the frequency of sunburn is more likely.

    A scatterplot shows frequency of sunburn on the vertical axis and temperature on the horizontal axis. The points form an upward trend, indicating that sunburn frequency increases as temperature increases.

    6. frequency of sunburn and ice cream sales

    Is there a correlation between the frequency of sunburn and ice cream sales? If yes, positive or negative?

    Show the answer
    • The correlation between the frequency of sunburn and total ice cream sales is positive.
    • HOWEVER, ice cream consumption does not cause sunburns or getting a sunburn doesn’t make someone eat more ice cream.
    • Both of these variables, ice cream consumption and sunburn frequency, are higher when it’s hotter outside.

    A scatterplot shows frequency of sunburn on the vertical axis and ice cream sales on the horizontal axis. The points trend upward, showing that higher ice cream sales are associated with higher sunburn frequency.

Confounding variables⚓︎

  • [[Confounding variable]]

    • A third variable, "temperature", that affects both variables, "frequency of sunburn" and "ice cream sales".

      • If one is not careful, it can make it appear that there is a correlation between two variables that are actually both independently being influenced by this third variable, "temperature".

        A simple diagram shows a thermometer at the top with arrows pointing to an ice cream icon on one side and a running person on the other, indicating that temperature influences both ice cream consumption and outdoor activity.

    • Other confounding variable examples:

      • Number of schools in a city and number of crimes in a city
        • Confounding Variable: City population
      • Shoe size and reading ability in children
        • Confounding Variable: Age of the child
      • Outdoor exercise frequency and vitamin D levels
        • Confounding Variable: Sunlight exposure

Bivariate correlation⚓︎

  • [[Bivariate correlation]]
    • Shows the relationship between two continuous variables in a table-graph format:
      • [[Correlation table]] shows this relationship in a table format:
        • Reports the correlation coefficient (r-value) and statistical significance (p-value).
      • [[Scatterplot graph]] shows this relationship in a graph format:
        • In addition to the r-value and p-value, displays the shape of the relationship.

Example 1: Significant and negative correlation⚓︎

Find the variables in Variables in GSS page⚓︎

  • We may wonder if there's a correlation between the respondents' education in years (educ) and television screen time in hours (tvhours).

    • Here we do not propose any kind of cause-and-effect.

      flowchart LR
      subgraph F["Continuous variable"]
          A[Respondents' education in years]
      end
      
      subgraph O["Continuous variable"]
          B[Television screen time in hours]
      end
      
      A <==>|May have a relationship| B
  • We want to make sure that educ and tvhours are continuous variables.

    Variable name Variable label Variable type Question wording and response categories
    educ Respondents' education in years Continuous What is the highest year of school you completed?
    tvhours

    From: Variables in GSS
    Television screen time in hours Continuous On the average day, how many hours do you personally watch television?

[[Correlation table]] #code⚓︎

  • [[Model code]]

    tab_corr (gss[c("variable1_here", "variable2_here")],
    p.numeric = T, triangle="lower")
    

  • [[Working code]]

    tab_corr (gss[c("educ", "tvhours")],
    p.numeric = T, triangle="lower")
    

    • Line 1: We put educ here ➜ variable1_here and tvhours here ➜ variable2_here.
      • The order of the variables doesn't matter.
      • Make sure there's no space after the variable name, such as "educ ".
      • Find the working code in this module's R script file. [[Highlighting and running]] this code will generate the output below.

[[Correlation table]] #output⚓︎

Respondents' education in years Television screen time in hours
Respondents' education in years
Television screen time in hours r = -0.163
p = <0.001***

[[Scatterplot graph]] #code⚓︎

  • [[Model code]]

    scatterplot(gss, "variable1_here", "variable2_here")
    

  • [[Working code]]

    scatterplot(gss, "educ", "tvhours")
    

    • Line 1: We put educ here ➜ variable1_here and tvhours here ➜ variable2_here.
      • The order of the variables doesn't matter.
      • Make sure there's no space after the variable name, such as "tvhours ".
      • Find the working code in this module's R script file. [[Highlighting and running]] this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

Scatterplot showing a negative relationship between education and television screen time. The smoothed trend line starts near 4 hours at 0 years of education, rises slightly above 4.2 hours around 7 to 10 years, then declines steadily after about 12 years to roughly 2.3 hours at 20 years of education; the annotation reports a weak but statistically significant correlation, r = -0.163, p = 0.000***.

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Significant (p < 0.05) and weak (r < |0.3|) correlation interpretation template

There is a significant correlation between [variable 1 label] and [variable 2 label] since the p-value is less than .05.

This correlation is negative and weak since the r-value is -0.xxx (less than |0.3|).

OR This correlation is negative and moderate since the r-value is -0.xxx (between |0.3| and |0.5|).

OR This correlation is negative and strong since the r-value is -0.xxx (higher than |0.5|).

This means that as [variable 1 label] increases [variable 2 label] decreases, and vice versa.

Significant (p < 0.05) and weak (r < |0.3|) correlation interpretation sample

There is a significant correlation between respondents' education in years and television screen time in hours since the p-value is less than .05.

This correlation is negative and weak since the r-value is -0.163 (less than |0.3|).

This means that as respondents' education in years increases television screen time in hours decreases, and vice versa.

  • We first check the p-value.
    • If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value. If it's less than 0.05, it's a significant correlation.
      • This analysis yields a [[significant correlation]] since the p-value < 0.05.
  • If it's a significant correlation, then we check the r-value.
    • Check the direction of the correlation: if there's a minus (-), it's a negative correlation; if there's no minus, it's a positive correlation.
      • This analysis yields a [[negative correlation]], -0.163.
    • Check the strength of the correlation. -0.163 is less than |0.3|, so this is a [[weak correlation]].
      • Remember, r-values are absolute numbers. It wouldn't matter if it is -0.163 or 0.163, it's still weak correlation.

Example 2: Significant and positive correlation⚓︎

Find the variables in Variables in GSS page⚓︎

  • We may wonder if there's a correlation between the respondents' mothers' socio-economic index score (masei10) and fathers' socio-economic index score (pasei10).

    • Here we do not propose any kind of cause-and-effect.

      flowchart LR
      subgraph F["Continuous variable"]
          A[Mothers' SES]
      end
      
      subgraph O["Continuous variable"]
          B[Fathers' SES]
      end
      
      A <==>|May have a relationship| B
  • We want to make sure that masei10 and pasei10 are continuous variables.

    Variable name Variable label Variable type Question wording and response categories
    masei10 Respondents' mothers' socio-economic index score Continuous Respondent's mother's socio-economic index score (calculated)

    (Min: 9, Max: 92.8)
    pasei10

    From: Variables in GSS
    Respondents' fathers' socio-economic index score Continuous Respondent's father's socio-economic index score (calculated)

    (Min: 9, Max: 93.7)

[[Correlation table]] #code⚓︎

  • Model code
    tab_corr (gss[c("variable1_here", "variable2_here")],
    p.numeric = T, triangle="lower")
    
  • Working code

    tab_corr (gss[c("masei10", "pasei10")],
    p.numeric = T, triangle="lower")
    

    • Line 1: We put masei10 here ➜ variable1_here and pasei10 here ➜ variable2_here.
      • The order of the variables doesn't matter.
      • Make sure there's no space after the variable name, such as "masei10 ".
      • Highlighting and running this code will generate the output below.

[[Correlation table]] #output⚓︎

Respondents' mothers' socio-economic index score Respondents' fathers' socio-economic index score
Respondents' mothers' socio-economic index score
Respondents' fathers' socio-economic index score r = 0.364
p = 0.001***

[[Scatterplot graph]] #code⚓︎

  • Model code

    scatterplot(gss, "variable1_here", "variable2_here")
    

  • Working code

    scatterplot(gss, "masei10", "pasei10")
    

    • Line 1: We put masei10 here ➜ variable1_here and pasei10 here ➜ variable2_here.
      • The order of the variables doesn't matter.
      • Make sure there's no space after the variable name, such as "masei10 ".
      • Highlighting and running this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

Scatterplot showing a positive relationship between respondents’ mothers’ socio-economic index score and respondents’ fathers’ socio-economic index score. The smoothed trend line rises from about 30 when mothers’ score is low, to around 50 in the middle range, and to the mid-60s at the highest mothers’ scores; the annotation reports a moderate, statistically significant correlation, r = 0.364, p = 0.000***.

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Significant (p < 0.05) and moderate (|0.3| < | r | < |0.5|) correlation interpretation template

There is a significant correlation between [variable 1 label] and [variable 2 label] since the p-value is less than 0.05.

This correlation is positive and moderate since the r-value is 0.xxx (|0.3| < | r | < |0.5|).

OR This correlation is positive and weak since the r-value is -0.xxx (less than |0.3|).

OR This correlation is positive and strong since the r-value is -0.xxx (higher than |0.5|).

This means that [variable 1 label] and [variable 2 label] increase and decrease together.

Significant (p < 0.05) and moderate (|0.3| < | r | < |0.5|) correlation interpretation sample

There is a significant correlation between respondents' mothers' socio-economic index score and respondents' mothers' socio-economic index score since the p-value is less than .05.

This correlation is positive and moderate since the r-value is 0.364 (|0.3| < | r | < |0.5|).

This means that respondents' mothers' socio-economic index score and respondents' mothers' socio-economic index score increase and decrease together.

  • We first check the p-value.
    • If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value. If it's less than 0.05, it's a significant correlation.
      • This analysis yields a [[significant correlation]] since the p-value < 0.05.
  • If it's a significant correlation, then we check the r-value.
    • Check the direction of the correlation: if there's a minus (-), it's a negative correlation; if there's no minus, it's a positive correlation.
      • This analysis yields a [[positive correlation]], 0.364.
    • Check the strength of the correlation. 0.364 is between |0.3| and |0.5|, so this is a [[moderate correlation]].
      • Remember, r-values are absolute numbers. It wouldn't matter if it is 0.364 or -0.3643, it's still moderate correlation.

Example 3: Nonsignificant correlation⚓︎

Step 1: Find the variables in Variables in GSS page⚓︎

  • We may wonder if there's a correlation between the number of hours respondents worked last week (hrs1) and the number of brothers and sisters respondents have (sibs).

    • Here we do not propose any kind of cause-and-effect.

      flowchart LR
      subgraph F["Continuous variable"]
          A[Hours worked last week]
      end
      
      subgraph O["Continuous variable"]
          B[# of brothers and sisters]
      end
      
      A <==>|May have a relationship| B
  • We want to make sure that hrs1 and sibs are continuous variables.

    Variable name Variable label Variable type Question wording and response categories
    hrs1 Number of hours respondents worked last week Continuous How many hours did you work last week, at all jobs?
    sibs

    From: Variables in GSS
    Number of brothers and sisters respondents have Continuous How many brothers and sisters do you have?

[[Correlation table]] #code⚓︎

  • Model code

    tab_corr (gss[c("variable1_here", "variable2_here")],
    p.numeric = T, triangle="lower")
    

  • Working code

    tab_corr (gss[c("hrs1", "sibs")], 
    p.numeric = T, triangle="lower")
    

  • Line 1: We put hrs1 here ➜ variable1_here and sibs here ➜ variable2_here.

    • The order of the variables doesn't matter.
    • Make sure there's no space after the variable name, such as "sibs ".
    • Highlighting and running this code will generate the output below.

[[Correlation table]] #output⚓︎

Number of hours respondents worked last week Number of brothers and sisters respondents have
Number of hours respondents worked last week
Number of brothers and sisters respondents have r = -0.039
p = 0.068

[[Scatterplot graph]] #code⚓︎

  • Model code

    scatterplot(gss, "variable1_here", "variable2_here")
    

  • Working code

    scatterplot(gss, "hrs1", "sibs")
    

  • Line 1: We put hrs1 here ➜ variable1_here and sibs here ➜ variable2_here.

    • The order of the variables doesn't matter.
    • Make sure there's no space after the variable name, such as "sibs ".
    • Highlighting and running this code will generate the output below.

[[Scatterplot graph]] #output⚓︎

Scatterplot showing little to no relationship between number of hours worked last week and number of brothers and sisters respondents have. The smoothed line rises from about 2.7 siblings at 0 hours to just above 3.1 around 25 hours, drops to about 2.5 near 45 hours, then fluctuates slightly and ends near 2.6; the annotation reports a very weak, non-significant correlation, r = -0.039, p = 0.068.

[[Correlation table]] and [[scatterplot graph]] #interpretation⚓︎

Nonsignificant correlation interpretation template

There is a no significant correlation between [variable 1 label] and [variable 2 label] since the p-value is greater than .05.

This means that [variable 1 label] and [variable 2 label] are not related.

Nonsignificant correlation interpretation sample

There is a no significant correlation between the number of hours respondents worked last week and the number of brothers and sisters respondents have since the p-value is greater than .05.

This means that the number of hours respondents worked last week and the number of brothers and sisters respondents have are not related.

  • We first check the p-value. If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value.
    • This analysis yields a [[nonsignificant correlation]] since the p-value > 0.05.

Multivariate correlation⚓︎

  • [[Multivariate correlation]]
    • Shows the relationships among multiple (more than two) continuous variables in a table-graph format.
      • [[Correlation table matrix]] shows these relationships in a table format:
        • Reports the correlation coefficients (r-value) and statistical significance (p-value).
      • [[Scatterplot graph matrix]] shows these relationships in a graph format:
        • In addition to the r-value and p-value, displays the shape of the relationship.

Example: Correlation table matrix⚓︎

Find the variables in Variables in GSS page⚓︎

  • We'll use all the variables we've used so far in the previous sections.
Variable name Variable label Variable type Question wording and response categories
educ Respondents' education in years Continuous What is the highest year of school you completed?
tvhours Television screen time in hours Continuous On the average day, how many hours do you personally watch television?
masei10 Respondents' mothers' socio-economic index score Continuous Respondent's mother's socio-economic index score (calculated)

(Min: 0; Max: 100)
pasei10 Respondents' fathers' socio-economic index score Continuous Respondent's father's socio-economic index score (calculated)
(Min: 0; Max: 100)
hrs1 Number of hours respondents worked last week Continuous How many hours did you work last week, at all jobs?
sibs Number of brothers and sisters respondents have Continuous How many brothers and sisters do you have?

[[Correlation table matrix]] #code⚓︎

  • Model code

    tab_corr (gss[, c("variable1_here", "variable2_here", "variable3_here", "variable4_here", "variable5_here", "variable6_here")],  
    p.numeric = T, triangle="lower", na.deletion = "pairwise")
    

  • Working code

    tab_corr (gss[, c("educ", "tvhours", "masei10", "pasei10", "hrs1", "sibs")], 
    p.numeric = T, triangle="lower", na.deletion = "pairwise")
    

  • Line 1: We put the previous variables here ➜ variable1_here and variable2_here and so on.

    • The order of the variables doesn't matter.
    • Highlighting and running this code will generate the table below.

[[Correlation table matrix]] #output⚓︎

Respondents' education TV screen time Mothers' SES Fathers' SES Hours worked last week Number of siblings
Respondents' education
TV screen time r = -0.163
p = 0.001***
Mothers' SES r = 0.287
p = 0.001***
r = -0.125
p = 0.001***
Fathers' SES r = 0.353
p = 0.001***
r = -0.123
p = 0.001***
r = 0.364
p = 0.001***
Hours worked last week r = 0.026
p = 0.226
r = -0.029
p = 0.265
r = 0.034
p = 0.170
r = -0.024
p = 0.338
Number of siblings r = -0.208
p = 0.001***
r = 0.104
p = 0.001***
r = -0.221
p = 0.001***
r = -0.187
p = 0.001***
r = -0.039
p = 0.068

[[Scatterplot graph matrix]] #code⚓︎

scatterplot_matrix <- gss[, c("variable1_here", "variable2_here", "variable3_here", "variable4_here", "variable5_here", "variable6_here")]
pairs_panels_pval(scatterplot_matrix, color = "#15616d")
scatterplot_matrix <- gss[, c("educ", "tvhours", "masei10", "pasei10", "hrs1", "sibs")]
pairs_panels_pval(scatterplot_matrix, color = "#15616d")
  • Line 1: We put the previous variables here ➜ variable1_here and variable2_here and so on.
    • The order of the variables doesn't matter.
    • Highlighting and running this code will generate the output below.

[[Scatterplot graph matrix]] #output⚓︎

A scatterplot matrix showing correlations among six variables: educ, tvhours, masei10, pasei10, hrs1, and sibs. The diagonal displays histograms for each variable. The upper-right triangle shows r-values and p-values for each variable pair. The lower-left triangle shows scatterplots. Annotations highlight how to locate two variables of interest (tvhours and hrs1): find their histograms on the diagonal (1), their correlation in the upper-right intersection (2, r=-0.029, p=0.265), and their scatterplot in the lower-left intersection (3).

  • (1) Start with the diagonal:
    • Find your two variables of interest: here, for example, tvhours and educ.
    • The diagonal cells (in red) show each variable's distribution as a histogram.
  • (2) Look upper-right:
    • The upper-right cell (in purple) at the intersection of your two variables reports the correlation coefficient (r-value) and statistical significance (p-value). Here, r=-0.029, p=0.265.
  • (3) Look lower-left:
    • The lower-left cell (in blue) at the same intersection shows the scatterplot.

[[Correlation table matrix]] and [[scatterplot graph matrix]] #interpretation⚓︎

Nonsignificant correlation interpretation template

There is a no significant correlation between [variable 1 label] and [variable 2 label] since the p-value is greater than .05.

This means that [variable 1 label] and [variable 2 label] are not related.

Nonsignificant correlation interpretation sample

There is a no significant correlation between the television screen time in hours and the number of hours respondents worked last week since the p-value is greater than .05.

This means that the television screen time in hours and the number of hours respondents worked last week are not related.

  • We first check the p-value. If it's greater than 0.05, it's a nonsignificant correlation and we do not interpret the r-value.
    • This analysis yields a [[nonsignificant correlation]] since the p-value > 0.05.