12. Dummy variables⚓︎
Module items⚓︎
R Script file⚓︎
Copy the code below ➜ Paste into [[RStudio console]] ➜ Hit Enter
source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R"));
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/12-dummy.R", "12-dummy.R");
file.edit("12-dummy.R")
Lab assignment⚓︎
Sample lab assignment⚓︎
Learning outcomes⚓︎
- Learn how to create dummy variables for categorical and continuous variables
- Learn how to use dummy variables in a linear regression model
- Learn how to interpret the coefficients of dummy variables
Suggested reading⚓︎
- 📖
Allen, Michael Patrick. 1997. “Regression Analysis with Dummy Variables.” Pp. 128–32 in Understanding Regression Analysis. Boston, MA: Springer.
Dummy variables definition⚓︎
- Regression analysis is used with continuous variables.
- However, in social sciences we often need to work with categorical variables in which the different values have no real numerical relationship with each other.
- A [[dummy variable]] is a numerical variable used in regression analysis to represent the effect of categorical factor variables on the outcome variable.
- Specifically, dummy variables are used to compare categories, such as “the effect of being female on income compared to being male.”
Examples⚓︎
Example: Rent prices⚓︎
- Imagine a linear regression analysis where we aim to demonstrate the effects of “population of city” (a continuous variable) and “house type” (categorical: 1 for Townhouse, 2 for Studio) on rent.
-
In such a model, we would argue that the "population of the city" would increase the rent. Moreover, living in a townhouse would increase the rent compared to living in a studio apartment.
flowchart LR subgraph F["Factor variables"] A[Population of city <br><br> Continuous variable <br><br> Min: 5K, Max: 10M] end subgraph F["Factor variables"] B[House type <br><br> Categorical variable <br><br> 1: Townhouse; 2: Studio] end subgraph O["Outcome variable"] C[Rent] end A -.->|May affect| C B -.->|May affect| C -
If we include the original house type variable in our model as it is, the linear regression model will incorrectly assume that:
- Living in a "Studio" (2) is somehow twice the value of living in a “Townhouse” (1).
- This would be similar to estimating the coefficients for age; 20-year-old and 40-year-old respondents (40 is twice 20). The solution:
- The solution is to use [[dummy variable]]s - variables with only two values, zero and one.
- In dummy variable 1 (townhouse), one indicates people living in townhouse, zero indicates who do not. So, townhouse dummy variable will show the effect of “living in a townhouse” on the rent.
-
In dummy variable 2 (studio), one indicates people living in studio, zero indicates who do not. So, studio dummy variable will show the effect of “living in a studio” on the rent.
respondent Original variable Dummy variable 1 Dummy variable 2 housetype townhouse studio 1 2 (studio) 0 1 2 1 (townhouse) 1 0 3 2 (studio) 0 1 4 1 (townhouse) 1 0 5 2 (studio) 0 1 6 1 (townhouse) 1 0
GSS example: Sex⚓︎
- If we have a variable for sex with two responses (1=male and 2=female), we can't use the original values of 1 and 2, and interpret that as meaning being female is somehow two times of that being male.
- The solution is to use dummy variables - variables with only two values, zero and one.
- It does make sense to create a variable called “male" and interpret it as meaning that someone assigned a 1 on this variable is male and someone with a 0 is not.
-
Following this procedure, we also create a separate “female” dummy variable to show the effects of being female on the outcome variable, such as "personal income."
respondent Original variable Dummy variable 1 Dummy variable 2 sex male female 1 1 (male) 0 1 2 1 (male) 1 0 3 2 (female) 0 1 4 1 (male) 1 0 5 2 (female) 0 1 6 2 (female) 1 0
-
Dummy variable coding structure⚓︎
-
If there are two categories in the variable, we create two dummy variables, one for each category.
Variable name Variable label Variable type Question wording and response categories sexRespondents' sex Binary What's your sex?
(1: Male; 2: Female) -
Since we create new variables using the dummy variable codes below, we also need to write these dummy variables' new variable labels on our own. Such as;
- Respondents' sex (1: Male; 2: Female)
- Being male
- Being female
- Respondents' immigrant status (1: Yes; 2: No)
- Being nonimmigrant
- Being immigrant
- Belief in life after death (1: Yes; 2: No)
- Believing in life after death
- Not believing in life after death
- Level of finding life exciting (1: Exciting; 2: Routine; 3: Dull)
- Finding life exciting
- Finding life routine
- Finding life dull
- Confidence level in education (1: A great deal; 2: Only some; 3: Hardly any)
- Having a great deal confidence in education
- Having only some confidence in education
- Having hardly any confidence in education
- Home ownership status (1: Own; 2: Rent)
- Owning a house
- Or Having a house
- Not owning a house
- Or Not having a house
- Owning a house
- Respondents' sex (1: Male; 2: Female)
-
Model code
-
Working code
- Line 1: We put the new variable name here:
malehere ➜dummyvariable1_name- It's better to write the value label here, thus
male. Something simple and memorable.
- It's better to write the value label here, thus
- Line 2: We put the original variable that we want to create a dummy variable.
sexhere ➜orig_var- since
1is male in GSS dataset, we put1here ➜value- The whole code with the last
1, 0means:- "if sex is 1, create a new variable called “male”, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 3: We write this new dummy variable's variable label here "
Being male" - Line 5: We put the new variable name here:
femalehere ➜dummyvariable1_name- It's better to write the value label here, thus
female. Something simple and memorable.
- It's better to write the value label here, thus
- Line 6: We put the original variable that we want to create a dummy variable.
sexhere ➜orig_var- since
2is female in GSS dataset, we put2here ➜value- The whole code with the last
1, 0means:- "if sex is 2, create a new variable called “female”, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 7: We write this new dummy variable's variable label here "
Being female" - Creating a dummy variable is in a way recoding a variable, and thus creating a new variable.
- Running these codes will create two more variables in GSS dataset,
maleandfemale.- That's it, two more variables, do not expect to see an output.
- Line 1: We put the new variable name here:
Adding dummy variable to a regression model⚓︎
-
Let's add the dummy variables we have just created to the model we used in Linear regression basics module.
-
Model code
-
Working code
- Line 1: We put
conrinchere ➜outcome_here;res16here ➜factor_var1;agehere ➜factor_var2;prestg10here ➜factor_var3;educhere ➜factor_var4;malehere ➜factor_var5;femalehere ➜factor_var6.- Outcome variable variable first; then, factor variables separated by plus (+).
- Line 2: Check the first argument:
model5. If this is model5, then we should use model5 here- This needs to be
model5, otherwise this code won't work.
- This needs to be
- Line 1: We put
Respondents' personal income
| Factors | Coefficients | std. Beta | p |
|---|---|---|---|
| (Intercept) | -48487.47 (3955.96) |
-0.00 (0.02) |
0.001*** |
| Population density of residence during adolescence years |
570.97 (409.78) |
0.03 (0.02) |
0.164 |
| Respondents' age | 191.36 (40.68) |
0.09 (0.02) |
0.001*** |
| Respondents' occupational prestige score | 624.94 (49.18) |
0.26 (0.02) |
0.001*** |
| Respondents' education in years | 2786.02 (252.43) |
0.23 (0.02) |
0.001*** |
| male | 11787.50 (1187.55) |
0.18 (0.02) |
0.001*** |
| Observations | 2268 | ||
| R² / R² adjusted | 0.231 / 0.229 |
- RStudio created the table, but where's
female? -
Actually, RStudio console part shows: "Parameters
femalewere not estimable : -
First, let’s start with the interpretation of “male” dummy variable:
- “Being male increases personal income by $11,787 compared to being female.”
- We inherently know that “being female decreases personal income by $11,787 compared to being male.” Therefore, adding both dummy variables is redundant.
- If we use
femaleinstead ofmalein the model, the coefficient of female will be negative, -11,787.
- If we use
- We inherently know that “being female decreases personal income by $11,787 compared to being male.” Therefore, adding both dummy variables is redundant.
- “Being male increases personal income by $11,787 compared to being female.”
[[Omitting dummy variable]]⚓︎
- Then, which dummy variable(s) to include and which one(s) to omit?
- We include “male” if we want to discuss the effect of being male, as compared to female, on personal income;
- We include “female” if we want to discuss the effect of being female, as compared to male, on personal income.
- This decision won’t change the model.
- However, we need to decide which one to include and which one to omit.
- Otherwise, RStudio will remove the last added dummy variable.
- However, we need to decide which one to include and which one to omit.
- This decision won’t change the model.
-
Omitting a dummy variable doesn't mean we remove them:
-
We actually attribute a significant meaning to the omitted dummy variable that:
-
it becomes our [[comparison dummy variable]]:
- and, it's used in interpretation as well.
Interpretation of dummy variables
“Being male (included dummy variable) increases personal income by $11,787 compared to being female (omitted - comparison dummy variable).”
OR
“Being female (included dummy variable) decreases personal income by $11,787 compared to being male (omitted - comparison dummy variable).”
-
-
Likewise,
- if we create two dummy variables using a single original variable:
- We need to include one, and omit one.
- if we create three dummy variables using a single original variable:
- We need to include two, and omit one.
- if we create four dummy variables using a single original variable:
- We need to include three, and omit one.
- if we create two dummy variables using a single original variable:
-
GSS example: Predicting personal income (conrinc)⚓︎⚓︎
- We'll add dummy variables to the model we used in Linear regression basics module.
Find the variables in Variables in GSS page⚓︎
-
In this analysis, we propose a cause-and-effect relationship in which these factor variables may affect (increase or decrease) personal income.
flowchart LR subgraph C0[Continuous factor variables] direction TB A[Population density] B[Age] C[Occupational prestige] D[Education] end subgraph D0[Dummy factor variables] subgraph I0[Immigration status] direction TB I1[Being nonimmigrant] I2[Being immigrant] end subgraph M0[Marital status] direction TB M1[Being married] M2[Being formerly in union] M3[Being single ] end subgraph S0[Socio-economic status] direction TB S1[Low SES] S2[Moderate SES] S3[High SES] end end subgraph O0[Outcome variable] E[Personal income] end A -.->|May affect| E B -.->|May affect| E C -.->|May affect| E D -.->|May affect| E I0 -.->|May affect| E M0 -.->|May affect| E S0 -.->|May affect| E -
We want to make sure that
conrincandres16,age,prestg10, andeducare continuous variables or usable ordinal variables (Ordinal ✅), and need to see the values of categorical variables for dummy variable codes.
| Variable name | Variable label | Variable type | Question wording and response categories |
|---|---|---|---|
conrinc |
Respondents' personal income | Continuous | What is your income in dollars? (Min: $281.5; Max, $123,761.9) |
res16 |
Population density of residence during adolescence years | Ordinal ✅ | Which of the categories on this card comes closest to the type of place you were living in when you were 16 years old? (1: Country, nonfarm; 2: Farm; 3: Town less than 50K; 4: 50K to 250K; 5: Big city, suburb; 6: City greater than 250K) |
age |
Respondents' age | Continuous | What is your age? (Min: 18, Max: 89) |
prestg10 |
Respondents' occupational prestige score | Continuous | Respondent's occupational prestige score (calculated) (Min: 16, Max: 80) |
educ |
Respondents' education in years | Continuous | What is the highest year of school you completed? (Min: 0, Max: 20) |
born |
Respondents' immigrant status | Binary | Were you born in this country? (1: Yes; 2: No) |
marital |
Respondents' marital status | Nominal | Are you currently — married, widowed, divorced, separated, or have you never been married? (1: Married; 2: Widowed; 3: Divorced; 4: Separated; 5: Never married) |
sei10 From: Variables in GSS |
Respondents' socio-economic index score | Continuous | Socio-economic index score of the respondent (calculated) (Min: 9, Max: 92.8) |
[[Dummy variable]]: Categorical (binary) #code⚓︎
-
Model code
-
Working code
- Line 1: We put the new variable name here:
nonimmigranthere ➜dummyvar1- It's better to write something simple and memorable value label here, thus
nonimmigrant.
- It's better to write something simple and memorable value label here, thus
- Line 2: We put the original variable that we want to create a dummy variable.
bornhere ➜orig_var- since
1is "yes to being born in this country" in GSS dataset, we put1here ➜value- The whole code with the last
1, 0means:- "if born is 1, create a new variable called nonimmigrant, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 3: We write this new dummy variable's variable label here "
Being nonimmigrant" - Line 5: We put the new variable name here:
immigranthere ➜dummyvar2- It's better to write something simple and memorable here, thus
immigrant.
- It's better to write something simple and memorable here, thus
- Line 6: We put the original variable that we want to create a dummy variable.
bornhere ➜orig_var- since
2is "no to being born in this country" in GSS dataset, we put2here ➜value- The whole code with the last
1, 0means:- "if born is 2, create a new variable called immigrant, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 7: We write this new dummy variable's variable label here "
Being immigrant" - Creating a dummy variable is in a way recoding a variable, and thus creating a new variable.
- Running these codes will create two more variables in GSS dataset,
nonimmigrantandnonimmigrant.- That's it, two more variables, do not expect to see an output.
- Line 1: We put the new variable name here:
[[Dummy variable]]: Categorical (nominal/ordinal) #code⚓︎
-
Model code
-
Working code
- The codes are as same as the dummy variable codes for categorical (binary).
- The exception is the second dummy variable in which we merge some categories.
- Line 5: We put the new variable name here:
formerlyunionhere ➜dummyvar1- It's better to write something simple and memorable, thus
formerlyunion.
- It's better to write something simple and memorable, thus
- Line 6: We put the original variable that we want to create a dummy variable.
maritalhere ➜orig_var- Since we want to merge respondents who are
2"Widowed",3"Divorced", and4"Separated", we putmarital == 2 | gss$marital == 3 | gss$marital == 4, 1, 0here ➜gss$orig_var == value | gss$orig_var == value | gss$orig_var == value, 1, 0- The whole code with the last
1, 0means:- "if marital is 2 or 3 or 4, create a new variable called formerlyunion, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 7: We write this new dummy variable's variable label here "
Being formerly in union" - Creating a dummy variable is in a way recoding a variable, and thus creating a new variable.
- Running these codes will create three more variables in GSS dataset,
married,formerlyunion, andsingle.- That's it, three more variables, do not expect to see an output.
- The codes are as same as the dummy variable codes for categorical (binary).
[[Dummy variable]]: Continuous #code⚓︎
-
Model code
-
Working code
Clarification on codes: Click to expand
- Creating dummy variables using continuous variables requires a different coding structure.
- This is because continuous variables have many possible values.
sei10is the socio-economic index score of the respondent with minimum value of 9 and maximum value of 92.8.- We will create three separate categories for this continuous variable:
- 1: Less than or equal to 40
- This means respondents with a
sei10score of 40 or lower. - Code:
ifelse(gss$sei10 <= 40, 1, 0)
- This means respondents with a
- 2: Between 41 and 75
- This means respondents with a
sei10score from 41 to 75. - Code:
ifelse(gss$sei10 >= 41 & gss$sei10 <= 75, 1, 0)
- This means respondents with a
- 3: Greater than or equal to 76
- This means respondents with a
sei10score of 76 or higher. - Code:
ifelse(gss$sei10 >= 76, 1, 0)
- This means respondents with a
- In these codes:
<=means less than or equal to.- Example:
sei10 <= 40 - This includes 40 and all values below 40.
- Example:
>=means greater than or equal to.- Example:
sei10 >= 76 - This includes 76 and all values above 76.
- Example:
&means and.- We use
&when both conditions must be true at the same time. - This means:
- The
sei10score must be 41 or higher- and
- The
sei10score must be 75 or lower
- The
- We use
- 1: Less than or equal to 40
- We will create three separate categories for this continuous variable:
Code explanation: Click to expand
- Line 1: We put the new variable name here:
lowseshere ➜dummyvar1- It's better to write something simple and memorable value label here, thus
lowses.
- It's better to write something simple and memorable value label here, thus
- Line 2: We put the original variable that we want to create a dummy variable.
sei10here ➜orig_var- since we define low ses as "40 and all values below 40", we put
40here ➜value- The whole code with the last
1, 0means:- "if sei10 is 40 and below, create a new variable called lowses, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 3: We write this new dummy variable's variable label here "
Having low socio-economic status" - Line 5: We put the new variable name here:
moderateseshere ➜dummyvar2- It's better to write something simple and memorable value label here, thus
moderateses.
- It's better to write something simple and memorable value label here, thus
- Line 6: We put the original variable that we want to create a dummy variable.
sei10here ➜orig_var- since we define moderate ses as "41 or higher and 75 or lower" (all the values between 41 and 75), we put
41here ➜lowest_valueand75here ➜highest_value- The whole code with the last
1, 0means:- "if sei10 is between 41 and 75, create a new variable called moderateses, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 7: We write this new dummy variable's variable label here "
Having moderate socio-economic status" - Line 9: We put the new variable name here:
highseshere ➜dummyvar2- It's better to write something simple and memorable value label here, thus
highses.
- It's better to write something simple and memorable value label here, thus
- Line 10: We put the original variable that we want to create a dummy variable.
sei10here ➜orig_var- since we define high ses as "76 or higher" we put
76here ➜value- The whole code with the last
1, 0means:- "if sei10 is 76 and higher, create a new variable called highses, assign them “1”, and assign the rest “0”".
- The whole code with the last
- Line 11: We write this new dummy variable's variable label here "
Having high socio-economic status" - Creating a dummy variable is in a way recoding a variable, and thus creating a new variable.
- Running these codes will create three more variables in GSS dataset,
lowses,moderateses, andhighses.- That's it, three more variables, do not expect to see an output.
- Creating dummy variables using continuous variables requires a different coding structure.
[[Linear regression]] with [[dummy variables]] #code⚓︎
-
Model code
-
Working code
[[Linear regression]] with [[dummy variables]] #output⚓︎
Respondents' personal income
| Factors | Coefficients | std. Beta | p |
|---|---|---|---|
| (Intercept) | 4314.45 (6534.17) |
-0.00 (0.02) |
0.509 |
| Population density of residence during adolescence years |
439.36 (400.21) |
0.02 (0.02) |
0.272 |
| Respondents' age | 68.35 (44.21) |
0.03 (0.02) |
0.122 |
| Respondents' occupational prestige score | 126.77 (68.75) |
0.05 (0.03) |
0.065 |
| Respondents' education in years | 1956.08 (253.45) |
0.16 (0.02) |
0.001*** |
| Being male | 10499.06 (1153.09) |
0.16 (0.02) |
0.001*** |
| Being immigrant | 2058.30 (1609.74) |
0.02 (0.02) |
0.201 |
| Being married | 6128.15 (1582.65) |
0.09 (0.02) |
0.001*** |
| Being single | -5621.85 (1763.87) |
-0.08 (0.03) |
0.001** |
| Having low socio-economic status | -22276.87 (2563.88) |
-0.35 (0.04) |
0.001*** |
| Having moderate socio-economic status | -6606.39 (1969.99) |
-0.10 (0.03) |
0.001** |
| Observations | 2260 | ||
| R² / R² adjusted | 0.291 / 0.288 |
[[Linear regression]] with dummy variables #interpretation⚓︎
Linear regression with dummy variables interpretation template. Click to expand
First section: The significance levels
Mention which variables [variable labels] are statistically significant, and which variables are statistically nonsignificant (if any). Variables with at least one asterisk (*) are statistically significant.
[Variable label of significant factor variable 1], [Variable label of significant factor variable 2], [Variable label of significant factor variable 3]... are statistically significant factors of [Variable label of outcome variable] since the p values are less than 0.05. [If any]: [Variable label of significant factor variable 4], [Variable label of significant factor variable 5]... is(are) not statistically significant factor(s) of [Variable label of outcome variable] since the p value(s) is(are) greater than 0.05.
Second section: The explanation of coefficients
Mention how significant factor variables increase or decrease the value of the outcome variable, using "Coefficients" (Coeff. column). When reporting the coefficients of continuos variables, ensure that the sentence includes the unit of analysis (one unit, a day, a score, a year, a dollar, etc.) of both the factor variables and the outcome variable. When reporting the coefficients of dummy variables, ensure that the sentence includes omitted - comparison category
A [unit/day/score,year,dollar (unit of analysis of factor variable1)] increase in [Variable label of significant factor variable1] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].
A [unit/day/score,year,dollar (unit of analysis of factor variable2)] increase in [Variable label of significant factor variable2] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].
A [unit/day/score,year,dollar (unit of analysis of factor variable3)] increase in [Variable label of significant factor variable3] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)].
[Variable label of included dummy variable] increases [Variable label of outcome variable] by [coefficient + unit/day/score,year,dollar (unit of analysis of outcome variable)] compared to [Variable label of omitted - comparison dummy variable]**.
Note: Do not mention nonsignificant variables here.
Third section: The explanation of standardized coefficients
Mention the strongest factor variables of the outcome variable using the "Standardized coefficients" (Std. Coeff. column) in order. Only mention the statistically significant ones. "Standardized coefficient" is an absolute number, which means -.56 is stronger than .45.
The strongest factor of [Variable label of outcome variable] is the [Variable label of first strongest factor variable] (std. Coeff=0.xx), followed by [Variable label of second strongest factor variable] (std. Coeff=0.xx), and [Variable label of third strongest factor variable] (std. Coeff=0.xx).
Note: Do not mention nonsignificant variables here.
Fourth section: The explanation of adjusted R-squared
Report the adjusted R-squared value as a percentage with the statistically significant variables.
The adjusted R squared value indicates that [adjusted R squared value] of the variation in [Variable label of outcome variable] can be explained by [Variable label of significant factor variable1], [Variable label of significant factor variable2], [Variable label of significant factor variable3]...
Note: Do not mention nonsignificant variables here.
Linear regression interpretation sample
First section: The significance levels
Respondents' education in years, being male, being married, being single, having low socio-economic status, and having moderate socio-economic status are statistically significant factors of respondents' personal income since the p values are less than 0.05. Population density of residence during adolescence years, respondents' age, respondents' occupational prestige score, and being immigrant are not statistically significant factors of respondents' personal income since the p value is greater than 0.05.
Second section: [The explanation of coefficients]
A year increase in respondents' education in years** increases respondents' personal income by $1,956.
Being male increases respondents' personal income by $10,499 compared to being female.
Being married increases respondents' personal income by $6,128 compared to being formerly in union.
Being single decreases respondents' personal income by $5,621 compared to being formerly in union.
Having low socio-economic status decreases respondents' personal income by $22,276 compared to having high socio-economic status.
Having moderate socio-economic status decreases respondents' personal income by $6,606 compared to having high socio-economic status.
Third section: The explanation of standardized coefficients
The strongest factor of respondents’ personal income is having low socio-economic status (std. Coeff=-0.35), followed by respondents' education in years (std. Coeff=0.16), being male (std. Coeff=0.16), having moderate socio-economic status (std. Coeff=-0.10), being married (std. Coeff=0.09), and being single (std. Coeff=-0.08).
Fourth section: The explanation of adjusted R-squared
The adjusted R squared value indicates that 28.8% of the variation in respondents' personal income can be explained by having low socio-economic status, respondents' education in years, being male, having moderate socio-economic status, being married, and being single.