11. Linear regression basics⚓︎

Module items⚓︎

R Script file⚓︎

[[Copy the code]] below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/11-linear-regression.R", "11-linear-regression.R"); 
file.edit("11-linear-regression.R")

Lab assignment⚓︎

Linear regression

Sample lab assignment⚓︎

Sample: Linear regression

Learning outcomes⚓︎

Define linear regression analysis
Identify situations in which linear regression is appropriate
Differentiate between explanatory modeling, descriptive modeling, and predictive modeling
Predict outcome variable based on factor variables
Identify the effect of confounding variables on modeling
Interpret coefficients, standardized betas, and adjusted R-squared values

Regression definition⚓︎

[[Regression]] is the most widely used statistical technique; it is a way to model a relationship between variables.
- In regression, we explain the effects of [[factor variable]]s on the [[outcome variable]] by estimating how changes in the factor variables are associated with changes in the outcome variable.
- Unlike [[correlation analysis]], which does NOT imply a causal relationship, regression does imply one and requires the specification of outcome and factor variables. A correlation table as an example:
```
tab_corr (gss[c("childs", "educ")],  
p.numeric = T, triangle="lower")
```
Respondents' education in years Number of children respondents have

Respondents' education in years

Number of children respondents have r = -0.182
p = 0.001***

Significant (p < 0.05) and weak (r < |0.3|) correlation interpretation sample

There is a significant correlation between respondents' education in years and the number of children respondents have since the p-value is less than .05.

This correlation is negative and weak since the r-value is -0.182 (less than |0.3|).

This means that as respondents' education in years increases the number of children respondents have decreases, and vice versa.
In a regression analysis, respondents' education in years (factor variable) may decrease the number of children respondents have (outcome variable), not the other way around.
- Otherwise, we might argue that having children decrease respondents' education.

[[Types of regression modeling based on outcome variable]]⚓︎

There are two types of regression modeling based on outcome variable:
- (1) [[Linear regression]]
  - In linear regression:
    - The [[outcome variable]] is
      - [[Continuous]] and,
    - There are more than one [[factor variable]] (any).
  - This module is about linear regression.
- (2) [[Logistic regression]]
  - In logistic regression:
    - The [[outcome variable]] is
      - [[Binary]] (more specifically a dummy variable - which we'll learn later) and,
    - There are more than one [[factor variable]] (any).

[[Types of regression modeling based on modeling type]]⚓︎

There are three types of regression modeling based on outcome variable, often not used mutually exclusive:
- (1) [[Explanatory modeling|ref]]
  - In such models, a set of factor variables are assumed to cause an effect on the outcome variable. The literature should tell us which variables should be used in models.
    - They are theory-driven.
    - Explanatory models are about understanding the causal relationships and the why behind them.
- (2) [[Predictive modeling|ref]]
  - The goal is to predict the outcome variable by using the values of factor variables. One’s income, for example, could be predicted in this modeling.
    - Predictive models are about accurately forecasting the outcome variable and the what will happen in the future.
- (3) [[Descriptive modeling|ref]]
  - Unlike explanatory modeling, in descriptive modeling the reliance on an underlying causal theory is absent or incorporated in a less formal way.
    - We construct logical connections between the variables to summarize the data.
    - Descriptive models are about summarizing and describing the data to reveal patterns, without necessarily making predictions or explanations.
- We will see some examples of regression analysis and outputs in the following section.

Example 1: Hospitalization days⚓︎

[[Predictive modeling]] and [[explanatory modeling]] will be used identify variables that predict the duration of hospital stays. This method involves [[explanatory modeling]] as well, since it examines cause-and-effect relationships among the variables that are theory-driven.
- Imagine working in a hospital, calculating hospitalization days to arrange available beds for the next month.
- Hospitalization days are calculated using regression analysis, which includes the following variables, among many others:
  - Seriousness of the operation: A continuous variable that quantifies the complexity and inherent risks of the operation on a scale (0: not serious; 100: very serious).
  - Age: The age of the patient in years.
  - Healthy lifestyle: A continuous variable that provides a measure of health lifestyle (0: extremely poor; 100: extremely healthy).
  - Previous hospitalizations: The total number of days spent in the hospital in the past due to medical procedures or illnesses.

Here's what such regression analysis table looks like:

Hospitalization days

Factors	Coeff.	std. Coeff	p
(Intercept)	-2.50 (0.85)	0.00 (0.03)	0.001***
Seriousness of the operation	0.30 (0.05)	0.05 (0.03)	0.002**
Age	0.12 (0.48)	0.17 (0.03)	0.001***
Healthy lifestyle	-0.22 (0.35)	-0.13 (0.03)	0.009**
Previous hospitalizations	0.25 (0.24)	0.34 (0.21)	0.043*
Observations	7995
R² / R² adjusted	0.648 / 0.659

Formula:
- Predicted hospitalization days = -2.50 + (0.30 × seriousness of the operation) + (0.12 × age) + (-0.22 × healthy lifestyle) + (0.25 × previous hospitalizations)
- The formula starts at -2.50 days (the intercept — the baseline before any patient information is added), then each variable nudges the prediction up or down.

Example calculation:

Say a patient has:

Seriousness of the operation = 40, Age = 88, Healthy lifestyle = 54, Previous hospitalizations = 6

Plug each value into the formula:

Step	Calculation	Result
Start with the baseline		-2.50 days
Seriousness of the operation (0.30 × 5)	adds	+12 days
Age (0.12 × 60)	adds	+10.56 days
Healthy lifestyle (-0.22 × 54)	subtracts	-11.88 days
Previous hospitalizations (0.25 × 2)	adds	+3 days
Predicted hospitalization days		= 11.18 days

Each factor shifts the predicted hospitalization days up or down:
Every 1-unit increase in seriousness of the operation adds 0.30 days
Every 1 year of age adds 0.12 days
Every 1-unit increase in healthy lifestyle reduces days by 0.22
Each previous hospitalization adds 0.25 days

Warning

In practice, this calculation is done for all 7,995 patients at once; not manually one by one.
This exercise simply shows the functions and effects of the coefficients.

Example 2: Car insurance rates⚓︎

[[Predictive modeling]] and [[explanatory modeling]] will be used to determine the variables that predict car insurance rates. This approach also involves [[explanatory modeling]], as it involves identifying cause-and-effect relationships that are theory-driven.
- Why do car insurance rates vary between different states and even between cities within the same state?
Car insurance rates are calculated using regression analysis, which includes the following variables, among many others:
- Population of the city: Larger populations may lead to higher accident rates due to increased traffic.
- Accidents per 1000 people: Reflects the overall risk associated with the city.
- Quality of roads: A city-wide average rating of road conditions.
- Crime rate: Higher crime rates, especially theft and vandalism, lead to higher insurance premiums.
- Weather patterns: Cities with severe weather conditions could experience more weather-related accidents (0:bad, 100:ideal).
- Traffic congestion levels: Cities with higher levels of congestion experience more fender benders.
- Average commute time: Longer commutes can increase the probability of accidents occurring.

Here's what such regression analysis table looks like:

Car insurance rate

Factors	Coeff.	std. Beta	p
(Intercept)	1452.87 (0.01)	0.00 (0.03)	0.001**
Population of the city	0.0015 (0.01)	0.412 (0.01)	0.001***
Accidents per 1000 people	2.35 (0.48)	0.238 (0.03)	0.001***
Quality of roads	-1.78 (0.35)	-0.213 (0.02)	0.004**
Crime rate	0.97 (0.24)	0.169 (0.02)	0.012*
Weather patterns	-0.045 (0.02)	-0.121 (0.04)	0.048*
Traffic congestion levels	0.215 (0.06)	0.172 (0.02)	0.021*
Average commute time	0.28 (0.07)	0.158 (0.04)	0.002**
Observations	19495
R² / R² adjusted	0.452 / 0.468

Formula:
- Predicted rate = 1452.87 + (0.0015 × population) + (2.35 × accidents per 1000) + (-1.78 × road quality) + (0.97 × crime rate) + (-0.045 × weather patterns) + (0.215 × traffic congestion) + (0.28 × average commute time)
- The formula starts at $1,452.87 (the baseline rate before any city characteristics are added), then each variable nudges the prediction up or down.

Example calculation:

Say a city has:
Population = 500,000, Accidents per 1000 = 8, Road quality = 60, Crime rate = 12, Weather patterns = 45, Traffic congestion = 7, Average commute time = 30

Plug each value into the formula:

Step	Calculation	Result
Start with the baseline		$1,452.87
Population (0.0015 × 500,000)	adds	+$750.00
Accidents per 1000 (2.35 × 8)	adds	+$18.80
Road quality (-1.78 × 60)	subtracts	-$106.80
Crime rate (0.97 × 12)	adds	+$11.64
Weather patterns (-0.045 × 45)	subtracts	-$2.03
Traffic congestion (0.215 × 7)	adds	+$1.51
Average commute time (0.28 × 30)	adds	+$8.40
Predicted insurance rate		= $2,134.39

What each variable does to the prediction:
Every 1 person increase in population adds $0.0015 (larger cities mean higher rates)
Every 1 additional accident per 1000 people adds $2.35
Every 1-unit improvement in road quality reduces the rate by $1.78 (better roads mean fewer claims)
Every 1-unit increase in crime rate adds $0.97
Every 1-unit improvement in weather patterns reduces the rate by $0.045 (milder weather means fewer accidents)
Every 1-unit increase in traffic congestion adds $0.215
Every 1 additional minute of average commute time adds $0.28

Linear regression specifics⚓︎

[[Linear regression]] table provides:
- [[Coefficient]]: The raw effect of each factor on the outcome in its original unit, e.g., every 1 additional accident per 1000 people adds $2.35 to the insurance rate.
  - [[Standard error]]: The margin of uncertainty around each coefficient; smaller standard errors mean more precise estimates. Standard errors are right under the coefficients in parenthesis.
- [[Standardized coefficient]]: The coefficient rescaled to allow comparison across factors regardless of their units. The larger the absolute value, the stronger the factor's relative importance. Standardized errors are right under the standardized beta in parenthesis.
  - This is useful because different variables are measured differently.
    - Age is measured in years.
    - Education is measured in years.
    - Income may be measured in dollars.
    - Occupational prestige may be measured with a score.
- [[p-value]]: The probability that the observed coefficient is due to chance. A p-value less than 0.05 is considered statistically significant.
- [[Adjusted R-square]]: The adjusted R-square value shows whether adding additional factor variables improve the explanatory power of regression model or not. The adjusted R-squared should be reported as a percentage.

GSS example: Predicting personal income (`conrinc`)⚓︎

[[Predictive modeling]] and [[descriptive modeling]] will be used to predict respondents' personal income based on the population density of residence during their adolescence years, age, and occupational prestige score, and their education in years.
- This modeling is descriptive in nature as well, since the variables have not been derived from existing literature but are based on logically constructed connections.
- We will demonstrate the possible effects of the factor variables on the outcome variable, (if any, also, the strength of the effects). They are not theory-driven.
- Regression analysis will answer the following question:
  - Do factor variables significantly determine respondents’ personal income? In other words, do they significantly increase or decrease respondents’ personal income?
    - If so, what is the contribution of each variable on respondents’ personal income?
    - If so, what is the strength of each variable’s contribution on respondents’ personal income?

Step 1: Find the variables in Variables in GSS page⚓︎

In this analysis, we propose a cause-and-effect relationship in which these factors may affect (increase or decrease) personal income.

flowchart LR
subgraph F["Continuous factor variables"]
    A[Population density]
end

subgraph F["Continuous variable"]
    B[Age]
end

subgraph F["Continuous variable"]
    C[Occupational prestige]
end

subgraph F["Continuous variable"]
    D[Education]
end

subgraph O["Continuous outcome variable"]
    E[Personal income]
end

A -.->|May affect| E
B -.->|May affect| E
C -.->|May affect| E
D -.->|May affect| E

We want to make sure that conrinc and res16, age, prestg10, and educ are continuous variables or usable ordinal variables (Ordinal ✅)

Variable name	Variable label	Variable type	Question wording and response categories
`conrinc`	Respondents' personal income	Continuous	What is your income in dollars? (Min: $281.5; Max, $123,761.9)
`res16`	Population density of residence during adolescence years	Ordinal ✅	Which of the categories on this card comes closest to the type of place you were living in when you were 16 years old? (1: Country, nonfarm; 2: Farm; 3: Town less than 50K; 4: 50K to 250K; 5: Big city, suburb; 6: City greater than 250K)
`age`	Respondents' age	Continuous	What is your age? (Min: 18, Max: 89)
`prestg10`	Respondents' occupational prestige score	Continuous	Respondent's occupational prestige score (calculated) (Min: 16, Max: 80)
`educ` From: Variables in GSS	Respondents' education in years	Continuous	What is the highest year of school you completed? (Min: 0, Max: 20)

[[Linear regression]] with 1 factor #code⚓︎

Model code

model1 <- lm(outcome_here ~ factor1_here, data = gss)
tab_model(model1, show.std = T, show.ci = F, collapse.se = T)

Working code
1 2
model1 <- lm(conrinc ~ res16, data = gss) tab_model(model1, show.std = T, show.ci = F, collapse.se = T)
- Note that we'll introduce the factor variables one by one.
- Line 1: We put conrinc here ➜ outcome_here and res16 here ➜ factor1_here.
  - Outcome variable variable first; factor variable second.
- Line 2: Check the first argument: model1. If this is model1, then we should use model1 here.
  - This needs to be model1, otherwise this code won't work.

[[Linear regression]] with 1 factor #output⚓︎

Respondents' personal income

Factors	Coeff.	std. Coeff	p
(Intercept)	31243.32 (1929.26)	-0.00 (0.02)	0.001***
Population density of residence during adolescence years	1449.97 (457.75)	0.07 (0.02)	0.002**
Observations	2340
R² / R² adjusted	0.004 / 0.004

Population density of residence during adolescence years is statistically significant factor of personal income (p < 0.05).
Looks like residing in 6: City greater than 250K, instead of 5: Big city increases personal income by $1,449,
- Alternatively speaking, residing in 2: Farm; instead of 3: Town less than 50K decreases personal income by $1,449.
Now, let's add the second factor variable, age

[[Linear regression]] with 2 factors #code⚓︎

Model code

model2 <- lm(outcome_here ~ factor1_here + factor2_here, data = gss)
tab_model(model2, show.std = T, show.ci = F, collapse.se = T)

Working code
1 2
model2 <- lm(conrinc ~ res16 + age, data = gss) # (1)! tab_model(model2, show.std = T, show.ci = F, collapse.se = T) # (2)!
- Line 1: We put conrinc here ➜ outcome_here; res16 here ➜ factor1_here; age here ➜ factor2_here.
  - Outcome variable variable first; factor variable second.
- Line 2: Check the first argument: model2. If this is model2, then we should use model2 here.
  - This needs to be model2, otherwise this code won't work.

[[Linear regression]] with 2 factors #output⚓︎

Respondents' personal income

Factors	Coeff.	std. Coeff	p
(Intercept)	17269.65 (2839.75)	-0.00 (0.02)	0.001***
Population density of residence during adolescence years	1515.16 (456.70)	0.07 (0.02)	0.001***
Respondents' age	308.91 (45.49)	0.14 (0.02)	0.001***
Observations	2303
R² / R² adjusted	0.023 / 0.023

Population density of residence during adolescence years and respondents' age variables are statistically significant factor of personal income (p < 0.05).
Looks like, for example, residing in 6: City greater than 250K, instead of 5: Big city increases personal income by $1,515,
One year increase in age increases personal income by $308. For this model, a 40-year-old would make $616 more compared to a 30-year-old.
Now, let's add the third factor variable, prestg10

[[Linear regression]] with 3 factors #code⚓︎

Model code

model3 <- lm(outcome_here ~ factor1_here + factor2_here + factor3_here, data = gss)
tab_model(model3, show.std = T, show.ci = F, collapse.se = T)

Working code

model3 <- lm(conrinc ~ res16 + age + prestg10, data = gss) # (1)! 
tab_model(model3, show.std = T, show.ci = F, collapse.se = T) # (2)! 

Line 1: We put conrinc here ➜ outcome_here and res16 here ➜ factor1_here; age here ➜ factor2_here; prestg10 here ➜ factor3_here.
- Outcome variable variable first; then, factor variables separated by plus (+).
- Check the first argument: model3. If this is model3, then we should use model3 here.
Line 2: Check the first argument: model3. If this is model3, then we should use model3 here.
- This needs to be model3, otherwise this code won't work.

[[Linear regression]] with 3 factors #output⚓︎

Respondents' personal income

Factors	Coeff.	std. Coeff	p
(Intercept)	-16380.05 (3167.04)	-0.00 (0.02)	0.001***
Population density of residence during adolescence years	938.46 (424.60)	0.04 (0.02)	0.027*
Respondents' age	224.50 (42.34)	0.10 (0.02)	0.001***
Respondents' occupational prestige score	881.86 (45.69)	0.37 (0.02)	0.001***
Observations	2278
R² / R² adjusted	0.160 / 0.159

Population density of residence during adolescence years, respondents' age, and respondents' occupational prestige score variables are statistically significant factor of personal income (p < 0.05).
- However, we see that the effect of population density of residence during adolescence years has dramatically decreased.
One year increase in age increases personal income by $308. For this model, for example, a 40-year-old would make $616 more compared to a 30-year-old.
One score increase in occupational prestige score increases personal income by $881. For this model, for example, a respondent with 80 occupational prestige score would make $8,881 more compared to a respondent with 70.
Finally, let's add our fourth, and final factor variable, educ.

[[Linear regression]] with 4 factors #code⚓︎

Model code

model4 <- lm(outcome_here ~ factor1_here + factor2_here + factor3_here + factor4_here, data = gss)
tab_model(model4, show.std = T, show.ci = F, collapse.se = T)

Working code
1 2
model4 <- lm(conrinc ~ res16 + age + prestg10 + educ, data = gss) tab_model(model4, show.std = T, show.ci = F, collapse.se = T)
Code explanation
- Line 1: We put conrinc here ➜ outcome_here; res16 here ➜ factor1_here; age here ➜ factor2_here; prestg10 here ➜ factor3_here; educ here ➜ factor4_here.
  - Outcome variable variable first; then, factor variables separated by plus (+).
- Line 2: Check the first argument: model4. If this is model1, then we should use model1 here
  - This needs to be model4, otherwise this code won't work.

[[Linear regression]] with 4 factors #output⚓︎

Respondents' personal income

Factors	Coeff.	std. Coeff	p
(Intercept)	-41471.83 (3964.93)	-0.00 (0.02)	0.001***
Population density of residence during adolescence years	499.80 (417.99)	0.02 (0.02)	0.232
Respondents' age	196.07 (41.51)	0.09 (0.02)	0.001***
Respondents' occupational prestige score	649.05 (50.12)	0.27 (0.02)	0.001***
Respondents' education in years	2624.02 (256.54)	0.22 (0.02)	0.001***
Observations	2272
R² / R² adjusted	0.197 / 0.196

One year increase in age increases personal income by $196. For this model, for example, a 40-year-old would make $196 more compared to a 30-year-old.
One score increase in occupational prestige score increases personal income by $649. For this model, for example, a respondent with 80 occupational prestige score would make $6,490 more compared to a respondent with 70.
One year increase in education increases personal income by $2,624. For this model, for example, a respondent with 10 years of schooling would make $26,240 less compared to a to a respondent with 20.
Population density of residence during adolescence years is NO LONGER statistically significant (p = 0.232).
What changed?: The presence of [[confounding variable]]
- In earlier models, population density appeared to have a significant positive effect on income.
  - But after adding education, this effect disappears.
What does this mean?
- This suggests that the earlier relationship between population density and income was not a direct effect.
  - Instead, it was influenced by another variable.
Confounding variable: Population density of residence during adolescence years
- Population density of residence during adolescence years is acting as a confounding variable here. A confounding variable is a variable that:
  - is related to the factor variable (education), and
  - also affects the outcome variable (income).
What does this mean, again?
- People who grow up in more densely populated areas (e.g., large cities) are more likely to have higher levels of education.
  - Higher education, in turn: increases personal income.
- So what looked like this at first: Population density → higher income
  - Actually looks like this: Population density → higher education → higher income
- Once we control for education, the direct effect of population density disappears.
  - That is why its coefficient becomes smaller (from ~1500 to ~500) and loses statistical significance.
- Without including education in the model, we would have made a misleading conclusion about population density.
  - This is why regression analysis is useful:
    - It allows us to control for confounding variables and identify more accurate relationships.
In short:
- Population density looked important at first, but its effect was actually explained by education.
- Education is the variable that clarifies the relationship.
```
flowchart LR
A[Confounder factor <br/> Population density]
B[Education]
C[Income]

A -->|Increases| B
B -->|Increases| C
A -.->|No direct effect| C
```

[[Linear regression interpretation specifics]]⚓︎

[[Reporting of coefficients]]⚓︎

When reporting the [[coefficient]]s of the factor variables, we ensure that the sentence includes the unit of analysis (one unit, score, year, dollars, etc.) of both the factor variable and the outcome variable.
- Factor variable: (health - Perceived personal health quality - (1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor))
- Outcome variable: (sei10 - Respondents' socio-economic index score - (Min: 9, Max: 92.8))
  - A one unit increase in perceived personal health quality increases respondents' socio-economic index score by 12.45 points.
- Factor variable: (age - Respondents' age - (Min: 18, Max: 89))
- Outcome variable: (polviews - Respondents' conservatism level - (1: extremely liberal; 7: extremely conservative))
  - A year increase in respondents' age increases respondents' conservatism level by 1.78 points.
- Factor variable: (maeduc - Respondents' mothers' education in years - (Min: 0, Max: 20))
- Outcome variable: (conrinc - Respondents' personal income - (Min: $281.5; Max, $123,761.9))
  - A year increase in respondents' mothers' education increases respondents’ personal income by $2,857.
In previous regression analysis, each factor variable (population density, age, occupational prestige score, and education) shows how they increase income with a one-unit increase, e.g.,
- How does one more unit increase in population density increase income?
  - It has no effect.
- How does one more year of age increase income?
  - A year increase in respondents' age increases personal income by $171.
- How does one more point of occupational prestige score increase income?
  - A score increase in occupational prestige score increases personal income by $649.
- How does one more year of education increase income?
  - A year increase in education increases personal income by $2,676.
It looks like the a year increase in education has the most dramatic effect.
- Is it?

Education in years (min: 0; max: 20)

The range of education is 0-21; there are 21 different responses.

flowchart LR
    A1["0"] -->|$2,676| A2["1"] --> A3["..."] --> A4["12"] -->|$2,676| A5["13"] --> A6["..."] --> A7["20"]

Occupational prestige score (min: 16; max: 80)
- The range of occupational prestige score is 16-80; there are 65 different responses.
```
flowchart LR
    A1["16"] -->|$649| A2["17"] --> A3["..."] --> A4["66"] -->|$649| A5["67"] --> A6["..."] --> A7["80"]
```
The contribution of variables with fewer responses to the outcome variable seems greater.
If we used parental income in the model, there would be 100,000 different responses ($1 range).
- The effect of one more dollar of parental income would contribute to our own income minimally.
- Therefore, coefficients are not comparable.
  - We CAN’T directly say "education increases income more than occupational prestige score does.”

[[Reporting of standardized coefficients]] (std. Coeff.)⚓︎

To compare the relative strength of factor variables, we use [[standardized coefficient]].
Standardized coefficients put all variables on the same scale.
- This way, we can compare variables even if they are measured differently and with different ranges.
  - Population density is measured with categories (1-6).
  - Age is measured in years (18-89).
  - Occupational prestige is measured with a prestige score (16-80).
  - Education is measured in years (0-20).
Therefore, standardized coefficients answer a slightly different question:
- Not “How much does one unit increase income?”
- But “Which factor variable has the strongest relative contribution to income?”
Based on standardized coefficients In the final model:
- "0.27" ➜ Occupational prestige score has the strongest relative contribution to personal income.
- "0.22" ➜ Education has the second strongest relative contribution.
- "0.09" ➜ Age has a smaller contribution.
- "0.02" ➜ Population density has almost no contribution. And, actually, it's statistically nonsignificant.
This is why we should not compare coefficients directly.
- Education has a larger raw coefficient than occupational prestige:
  - Education coefficient = 2,624
  - Occupational prestige coefficient = 649
- But the standardized coefficient of occupational prestige is larger:
  - Occupational prestige std. Coeff. = 0.27
  - Education std. Coeff. = 0.22
- This means that occupational prestige has a stronger relative contribution in this model.
We saw a similar issue in the car insurance rate example.
- The coefficient for population was very small:
  - Population coefficient = 0.0015
  - At first, this looks like population has almost no effect.
    - However, the standardized coefficient for population was the highest:
      - Population std. Beta = 0.412
- This happens because population has a very large range.
  - A one-person increase changes the insurance rate only a little.
  - But the difference between a small city and a large city can be very large.
    - Therefore, even when the raw coefficient is very small, the standardized coefficient can show that the variable is one of the strongest factors.
In short:
- [[Coefficient]]s show the effect of a one-unit increase.
- [[Standardized coefficient]]s show the relative strength of factor variables.
  - To compare which factor variable matters more in the model, we should look at standardized coefficients, not raw coefficients.
  - Then, this is how we'd sort the factor variables based on their relative strength.
    - The strongest factor of respondents’ personal income is respondents' occupational prestige score (std. Coeff=0.27), followed by respondents' education in years (std. Coeff=0.22), and respondents' age (std. Coeff=0.09).

[[Reporting of adjusted R-square]] (R² adjusted)⚓︎

The [[adjusted R-square]] value shows whether adding additional factor variables improves the explanatory power of the regression model or not.
- In other words, it shows how much of the outcome variable is explained by the factor variables in the model.
- The adjusted R-square value should be reported as a percentage.
Here's a shortcut for converting a number with decimals to a percentage:
- Move the dot two times to the right:
  - 0.004 ➜ 0.4%
  - 0.023 ➜ 2.3%
  - 0.159 ➜ 15.9%
  - 0.196 ➜ 19.6%
In our regression models:
- Model 1 adjusted R-square: 0.004 ➜ 0.4%
- Model 2 adjusted R-square: 0.023 ➜ 2.3%
- Model 3 adjusted R-square: 0.159 ➜ 15.9%
- Model 4 adjusted R-square: 0.196 ➜ 19.6%
This means that every additional factor variable improved the subsequent regression models.
- Model 2 explains more of personal income than Model 1.
- Model 3 explains more of personal income than Model 2.
- Model 4 explains more of personal income than Model 3.
In the final model, the adjusted R-square is 0.196.
- This means that population density, age, occupational prestige score, and education explain 19.6% of the variation in respondents' personal income.
- The remaining 80.4% is explained by other factors that are not included in this model.

[[Linear regression]] #interpretation⚓︎

Respondents' personal income

Factors	Coeff.	std. Coeff	p
(Intercept)	-41471.83 (3964.93)	-0.00 (0.02)	0.001***
Population density of residence during adolescence years	499.80 (417.99)	0.02 (0.02)	0.232
Respondents' age	196.07 (41.51)	0.09 (0.02)	0.001***
Respondents' occupational prestige score	649.05 (50.12)	0.27 (0.02)	0.001***
Respondents' education in years	2624.02 (256.54)	0.22 (0.02)	0.001***
Observations	2272
R² / R² adjusted	0.197 / 0.196

Linear regression interpretation template. Click to expand

First section: The significance levels
Mention which variables [variable labels] are statistically significant, and which variables are statistically nonsignificant (if any). Variables with at least one asterisk (*) are statistically significant.

[Variable label of significant factor variable 1], [Variable label of significant factor variable 2], [Variable label of significant factor variable 3]... are statistically significant factors of [Variable label of outcome variable] since the p values are less than 0.05. [If any]: [Variable label of significant factor variable 4], [Variable label of significant factor variable 5]... is(are) not statistically significant factor(s) of [Variable label of outcome variable] since the p value(s) is(are) greater than 0.05.

Second section: The explanation of coefficients
Mention how significant factor variables increase or decrease the value of the outcome variable, using "Coefficients" (Coeff. column). When reporting the coefficients, ensure that the sentence includes the unit of analysis (one unit, a day, a score, a year, a dollar, etc.) of both the factor variables and the outcome variable.

A [unit/day/score,year,dollar (unit of analysis of factor variable1)] increase in [Variable label of significant factor variable1] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].

A [unit/day/score,year,dollar (unit of analysis of factor variable2)] increase in [Variable label of significant factor variable2] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].

A [unit/day/score,year,dollar (unit of analysis of factor variable3)] increase in [Variable label of significant factor variable3] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].

Note: Do not mention nonsignificant variables here.

Third section: The explanation of standardized coefficients
Mention the strongest factor variables of the outcome variable using the "Standardized coefficients" (Std. Coeff. column) in order. Only mention the statistically significant ones. "Standardized coefficient" is an absolute number, which means -.56 is stronger than .45.

The strongest factor of [Variable label of outcome variable] is the [Variable label of first strongest factor variable] (std. Coeff=0.xx), followed by [Variable label of second strongest factor variable] (std. Coeff=0.xx), and [Variable label of third strongest factor variable] (std. Coeff=0.xx).

Note: Do not mention nonsignificant variables here.

Fourth section: The explanation of adjusted R-squared
Report the adjusted R-squared value as a percentage with the statistically significant variables.

The adjusted R squared value indicates that [adjusted R squared value] of the variation in [Variable label of outcome variable] can be explained by [Variable label of significant factor variable1], [Variable label of significant factor variable2], [Variable label of significant factor variable3]...

Note: Do not mention nonsignificant variables here.

Linear regression interpretation sample

First section: The significance levels
Respondents' age, respondents' occupational prestige score, and respondents' education in years are statistically significant factors of respondents' personal income since the p values are less than 0.05. Population density of residence during adolescence years is not a statistically significant factor of respondents' personal income since the p value is greater than 0.05.

Second section: The explanation of coefficients
One year increase in respondents' age increases respondents' personal income by $196.

A score increase in respondents' occupational prestige score increases respondents' personal income by $649.

A year increase in respondents' education in years increases respondents' personal income by $2,624.

Third section: The explanation of standardized coefficients
The strongest factor of respondents’ personal income is respondents' occupational prestige score (std. Coeff=0.27), followed by respondents' education in years (std. Coeff=0.22), and respondents' age (std. Coeff=0.09).

Fourth section: The explanation of adjusted R-squared
The adjusted R squared value indicates that 19.6% of the variation in respondents' personal income can be explained by Respondents' age, respondents' occupational prestige score, and respondents' education in years.

	Respondents' education in years	Number of children respondents have
Respondents' education in years
Number of children respondents have	r = -0.182 p = 0.001***

11. Linear regression basics⚓︎

Module items⚓︎

R Script file⚓︎

Lab assignment⚓︎

Sample lab assignment⚓︎

Learning outcomes⚓︎

Suggested reading⚓︎

Regression definition⚓︎

[[Types of regression modeling based on outcome variable]]⚓︎

[[Types of regression modeling based on modeling type]]⚓︎

Example 1: Hospitalization days⚓︎

Example 2: Car insurance rates⚓︎

Linear regression specifics⚓︎

GSS example: Predicting personal income (conrinc)⚓︎

Step 1: Find the variables in Variables in GSS page⚓︎

[[Linear regression]] with 1 factor #code⚓︎

[[Linear regression]] with 1 factor #output⚓︎

[[Linear regression]] with 2 factors #code⚓︎

[[Linear regression]] with 2 factors #output⚓︎

[[Linear regression]] with 3 factors #code⚓︎

[[Linear regression]] with 3 factors #output⚓︎

[[Linear regression]] with 4 factors #code⚓︎

[[Linear regression]] with 4 factors #output⚓︎

[[Linear regression interpretation specifics]]⚓︎

[[Reporting of coefficients]]⚓︎

[[Reporting of standardized coefficients]] (std. Coeff.)⚓︎

[[Reporting of adjusted R-square]] (R² adjusted)⚓︎

[[Linear regression]] #interpretation⚓︎

GSS example: Predicting personal income (`conrinc`)⚓︎