Skip to content

04. Recoding variables⚓︎

Module items⚓︎

R Script file⚓︎

Copy the code below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/04-recoding.R", "04-recoding.R"); 
file.edit("04-recoding.R")

Lab assignment⚓︎

Recoding

Sample lab assignment⚓︎

Sample: Recoding

Learning outcomes⚓︎

  1. Learn the definition of recoding
  2. Learn the reasons for recoding:
    1. Merging values
    2. Reversing values
    3. Transforming continuous variables into groups
  3. Identify the common recoding issues

Suggested reading⚓︎

  • 📖
    Van Tubergen, Frank. 2006. “Occupational Status of Immigrants in Cross-National Perspective: A Multilevel Analysis of Seventeen Western Societies.” Pp. 147–71 in Immigration and the Transformation of Europe, edited by C. A. Parsons and T. M. Smeeding. Cambridge University Press. (Specifically, "Dependent and independent variables" ps. 153-155)

[[Recoding]] definition⚓︎

  • It is rare that we use variables as they are in our analyses.
    • Instead, we often customize the values of variables for our needs.
  • Recoding means creating a new variable using the values of an original variable.
    • After recoding (creating a new variable), the data will include one more variable.

[[Reasons for recoding]]⚓︎

  • There are three reasons for recoding:
    1. [[Merging values]]
    2. [[Reversing values]]
    3. [[Transforming continuous variables into groups]]

[[Merging values]]⚓︎

  • For our analysis, we may want to merge the values of variables and create a new variable.

    • Merging values is for [[categorical]] variables.
    • Take marital variable in GSS:

      Variable name Variable label Variable type Question wording and response categories
      marital

      From: Variables in GSS
      Respondents' marital status Nominal Are you currently — married, widowed, divorced, separated, or have you never been married?

      (1: Married; 2: Widowed; 3: Divorced; 4: Separated; 5: Never married)
  • For our analysis, imagine we are interested in the income level of 1: married, 2: formerly in union, and 3: never married respondents.

    • Then, we will merge values and create a new variable by recoding.

      Merging values: maritalmaritalgroups

      • 1: married ➜ 1: married
      • 2: widowed, 3: divorced, 4: separated ➜ 2: formerly in union
      • 5: never married ➜ 3: never married
    • After recoding the original marital variable, which has 5 responses, our dataset will include one more variable called maritalgroups with 3 responses.

      respondent id marital maritalgroups
      1 1 (married) 1 (married)
      2 2 (widowed) 2 (formerly in union)
      3 1 (married) 1 (married)
      4 5 (never married) 3 (never married)
      5 3 (divorced) 2 (formerly in union)
      6 4 (separated) 2 (formerly in union)
      7 5 (never married) 3 (never married)
      8 4 (separated) 2 (formerly in union)
      9 2 (widowed) 2 (formerly in union)
      10 1 (married) 1 (married)
  • We need to inform RStudio about which numbers should be replaced with which numbers in our recoded (new) variable.

    • We use comma (,) to merge the values of categorical variables in the code:

      Merging values: maritalmaritalgroups

      • 1: married ➜ 1: married - 1 = 1
      • 2: widowed, 3: divorced, 4: separated ➜ 2: formerly in union - 2, 3, 4 = 2
      • 5: never married ➜ 3: never married - 5 = 3

[[Merging values]] - coding steps⚓︎

  1. Before recoding marital variable by merging values, note that we have 980 variables in total. After recoding, there will be 981 variables. Remember: recoding is for creating a new variable.

    RStudio data view: there are 980 variables before recoding.

  2. [[Merging values]] #code structure

    • [[Model code]]
      1
      2
      3
      4
      5
      6
      gss$new_variable_here <-
      rec(gss$original_variable_here, rec =
      "number1, number2 = 1 [label1]; 
      number3, number4= 2 [label2];
      number5, number6= 3 [label3]",
      var.label = "Recoded variable label")
      
    • [[Working code]]

      1
      2
      3
      4
      5
      6
      gss$maritalgroups <- 
      rec(gss$marital, rec = 
      "1 = 1 [Married]; 
      2, 3, 4 = 2 [Formerly in union];
      5 = 3 [Never married]",
      var.label = "Recoded respondents' marital status")
      

      Code explanation: Click to expand
      • maritalgroups: New name for the recoded variable. This will be added to GSS dataset.
        • We’ll type this name. No space, no special characters. Add “groups”, “reversed”, or “recoded” at the end of the original variable name or type anything that you will remember what this variable is.
      • marital: The original variable we want to recode. The new variable will be created based on the original variable's values.
      • [married], [formerly in union] and [never married]: New labels for the new values. These will appear on the table.
      • [var.label]: The last line is for writing the variable label of the new variable. We put the new variable's name here again, maritalgroups, and write this new variable's variable label here "Recoded respondents' marital status"
      • Line 1: We put the new variable name for the new recoded variable here, maritalgroups.
      • Line 2: We put the original variable we want to recode here, marital.
      • Lines 3-4-5 We merge values in these lines. "[...]" are the new labels for the new values. These will appear on our outputs.
      • Line 6: We write this new variable's variable label here "Recoded respondents' marital status".
  3. After [[highlighting and running]] the code above, GSS dataset will include one more variable as we have just created the maritalgroups variable.

    RStudio data view: there are 981 variables before recoding.

  4. [[Frequency table]] #code for the original variable (marital)

    • [[Model code]]
      frq(gss$variable_here, out = "v")
      
    • [[Working code]]

      frq(gss$marital, out = "v")
      

      • Line 1: We put marital here ➜ variable_here.
        • Find the working code in this module's R script file.
        • [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  5. [[Frequency table]] #output for the original variable (marital)

    Respondents' marital status (x)

    val label frq raw.prc valid.prc cum.prc
    1 Married 1659 41.62 41.78 41.78
    2 Widowed 269 6.75 6.77 48.55
    3 Divorced 579 14.53 14.58 63.13
    4 Separated 130 3.26 3.27 66.41
    5 Never married 1334 33.47 33.59 100.00
    NA NA 15 0.38 NA NA
  6. [[Frequency table]] #interpretation for the original variable (marital)

    Frequency table interpretation template

    The [variable label] variable shows that xx.xx% of the respondents are / have / feel / think / said / reported [label 1], xx.xx% of the respondents are / have / feel / said / reported [label 2]...

    • After the [variable label], we add the word of "variable" in your interpretation:
      • "The respondents' marital status variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
      • For example, "41.78% of the respondents are married" etc.
    • We interpret the valid percentage column (valid.prc).

    Frequency table interpretation sample

    The respondents' marital status variable shows that 41.78% of the respondents are married; 6.77% of the respondents are widowed; 14.58% of the respondents are divorced; 3.27% of the respondents are separated; and 33.59% of the respondents are never married.

  7. [[Frequency table for recoded variable]] #code (maritalgroups)

    • [[Model code]]
      frq(gss$variable_here, out = "v")
      
    • [[Working code]]

      frq(gss$maritalgroups, out = "v")
      

      • Line 1: We put maritalgroups here ➜ variable_here.
        • Find the working code in this module's R script file.
        • [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  8. [[Frequency table for recoded variable]] #output (maritalgroups)

    Recoded respondents' marital status (x)

    val label frq raw.prc valid.prc cum.prc
    1 Married 1659 41.62 41.78 41.78
    2 Formerly in union 978 24.54 24.63 66.41
    3 Never married 1334 33.47 33.59 100.00
    NA NA 15 0.38 NA NA
  9. [[Frequency table for recoded variable]] #interpretation (maritalgroups)

    Frequency table for recoded variables interpretation template

    The [recoded variable label] variable shows that xx.xx% of the respondents are / have / feel / think / said / reported [label 1], xx.xx% of the respondents are / have / feel / said / reported [label 2]...

    • Before the [variable label], we add the word of "recoded".
    • After the [variable label], we add the word of "variable" in your interpretation:
      • "The recoded respondents' marital status variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
      • For example, "24.63% of the respondents are formerly in union" etc.
    • We interpret the valid percentage column (valid.prc).

    Frequency table for recoded variables interpretation sample

    The recoded respondents' marital status variable shows that 41.78% of the respondents are married; 24.63% of the respondents are formerly in union; and 33.59% of the respondents are never married.

[[Reversing values]]⚓︎

  • For some variables, reversing their values is necessary to ensure that higher values represent higher levels of what they measure.

    • Reversing values is for [[ordinal]] variables.
    • Take satjob variable in GSS:

      Variable name Variable label Variable type Question wording and response categories
      satjob

      From: Variables in GSS
      Level of work satisfaction Ordinal, RECODE On the whole, how satisfied are you with the work you do?

      (1: Very satisfied; 2: Moderately satisfied; 3: A little dissatisfied; 4: Very dissatisfied)
    • Imagine respondent A, who are very dissatisfied with their work and responded with 4 (Very dissatisfied), and respondent B, who always are very satisfied and responded with 1 (Very satisfied).

      • If we use this variable as is in our analysis, the work satisfaction score for respondent A will be higher than respondent B. It should be the opposite.
    • Then, we reverse the values and create a new variable by recoding.

      Reversing values: satjobsatjobreversed

      • 1: Very satisfied ➜ 4: Very satisfied
      • 2: Moderately satisfied ➜ 3: Moderately satisfied
      • 3: A little dissatisfied ➜ 2: A little dissatisfied
      • 4: Very dissatisfied ➜ 1: Very dissatisfied
    • After recoding the original satjob variable, which has 4 responses, our dataset will include one more variable called satjobreversed with 4 responses.

    • The only difference between these two variables is that the responses are reversed, so higher numbers indicate higher levels of work satisfaction.

      respondent id satjob satjobreversed
      1 1 (Very satisfied) 4 (Very satisfied)
      2 2 (Moderately satisfied) 3 (Moderately satisfied)
      3 4 (Very dissatisfied ) 1 (Very dissatisfied )
      4 3 (A little dissatisfied) 2 (A little dissatisfied)
      5 2 (Moderately satisfied) 3 (Moderately satisfied)
      6 4 (Very dissatisfied ) 1 (Very dissatisfied )
      7 1 (Very satisfied) 4 (Very satisfied)
      8 3 (A little dissatisfied) 2 (A little dissatisfied)
      9 2 (Moderately satisfied) 3 (Moderately satisfied)
      10 4 (Very dissatisfied ) 1 (Very dissatisfied )
  • We need to inform RStudio about which numbers should be replaced with which numbers in our new variable (the recoded variable). We simply reverse the values, NO merging, so NO comma (,).

  • We use comma (,) to merge the values of categorical variables in the code:

    Reversing values: satjobsatjobreversed

    • 1: Very satisfied ➜ 4: Very satisfied - 1 = 4
    • 2: Moderately satisfied ➜ 3: Moderately satisfied - 2 = 3
    • 3: A little dissatisfied ➜ 2: A little dissatisfied - 3 = 2
    • 4: Very dissatisfied ➜ 1: Very dissatisfied - 4 = 1

[[Reversing values]] - coding steps⚓︎

  1. Before recoding marital variable by [[reversing values]], note that we have 981 variables in total. Original 980 variables + 1 variable we created above, maritalgroups.

    RStudio data view: there are 981 variables before recoding.

  2. [[Reversing values]] #code structure:

    • [[Model code]]

      1
      2
      3
      4
      5
      6
      7
      gss$new_variable_here <-
      rec(gss$original_variable_here, rec =
      "1 = 4 [label1]; 
      2 = 3 [label2];
      3 = 2 [label3];
      4 = 1 [label4]",
      var.label = "Recoded variable label")
      

    • [[Working code]]

      1
      2
      3
      4
      5
      6
      7
      gss$satjobreversed <- 
      rec(gss$satjob, rec = 
      "1 = 4 [Very satisfied]; 
      2 = 3 [Moderately satisfied];
      3 = 2 [A little dissatisfied];
      4 = 1 [Very dissatisfied]",
      var.label = "Recoded level of work satisfaction")
      

    Code explanation: Click to expand
    • satjobreversed: New name for the recoded variable. This will be added to GSS dataset.
      • We’ll type this name. No space, no special characters. Add “groups”, “reversed”, or “recoded” at the end of the original variable name or type anything that you will remember what this variable is.
    • satjob: The original variable we want to recode. The new variable will be created based on the original variable's values.
    • [Very satisfied], [Moderately satisfied], [A little dissatisfied] and [Very dissatisfied]: The existing labels for the new values. These will appear on the table.
    • [var.label]: The last line is for writing the variable label of the new variable. We put the new variable's name here again, satjobreversed, and write this new variable's variable label here "Recoded level of work satisfaction"
    • Line 1: We put the new variable name for the new recoded variable here, satjobreversed.
    • Line 2: We put the original variable we want to recode here, satjob.
    • Lines 3-4-5 We reverse values in these lines. "[...]" are the new labels for the new values. These will appear on our outputs.
    • Line 6: We write this new variable's variable label here "Recoded level of work satisfaction"
  3. After highlighting and running the code above, GSS dataset will include one more variable as we have just created the satjobreversed variable.

    RStudio data view: there are 982 variables after recoding.

  4. [[Frequency table]] #code for the original variable (satjob)

    • [[Model code]]
      frq(gss$variable_here, out = "v")
      
    • [[Working code]]

      frq(gss$satjob, out = "v")
      

      • Line 1: We put satjob here ➜ variable_here.
        • Find the working code in this module's R script file.
        • [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  5. [[Frequency table]] #output for the original variable (satjob)

    Level of work satisfaction (x)

    val label frq raw.prc valid.prc cum.prc
    1 Very satisfied 1162 29.15 41.77 41.77
    2 Moderately satisfied 1188 29.80 42.70 84.47
    3 A little dissatisfied 294 7.38 10.57 95.04
    4 Very dissatisfied 138 3.46 4.96 100.00
    NA NA 1204 30.21 NA NA
  6. [[Frequency table]] #interpretation for the original variable (satjob)

    Frequency table interpretation template

    The [variable label] variable shows that xx.xx% of the respondents are / have / feel / think / said / reported [label 1], xx.xx% of the respondents are / have / feel / said / reported [label 2]...

    • After the [variable label], we add the word of "variable" in your interpretation:
      • "The level of work satisfaction variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
      • For example, "41.77% of the respondents are very satisfied" etc.
    • We interpret the valid percentage column (valid.prc).

    Frequency table interpretation sample

    The level of work satisfaction variable shows that 41.77% of the respondents are very satisfied; 42.70% of the respondents are moderately satisfied; 10.57% of the respondents are a little dissatisfied; and 4.96% of the respondents are very dissatisfied with the work they do.

  7. [[Frequency table for recoded variable]] #code (satjobreversed)

    • [[Model code]]
      frq(gss$variable_here, out = "v")
      
    • [[Working code]]

      frq(gss$satjobreversed, out = "v")
      

      • Line 1: We put satjobreversed here ➜ variable_here.
        • Find the working code in this module's R script file.
        • [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  8. [[Frequency table for recoded variable]] #output (satjobreversed)

    Recoded level of work satisfaction (x)

    val label frq raw.prc valid.prc cum.prc
    1 Very dissatisfied 138 3.46 4.96 4.96
    2 A little dissatisfied 294 7.38 10.57 15.53
    3 Moderately satisfied 1188 29.80 42.70 58.23
    4 Very satisfied 1162 29.15 41.77 100.00
    NA NA 1204 30.21 NA NA
  9. [[Frequency table for recoded variable]] #interpretation (satjobreversed)

    Frequency table for recoded variables interpretation template

    The [recoded variable label] variable shows that xx.xx% of the respondents are / have / feel / think / said / reported [label 1], xx.xx% of the respondents are / have / feel / said / reported [label 2]...

    • Before the [variable label], we add the word of "recoded".
    • After the [variable label], we add the word of "variable" in your interpretation:
      • "The recoded level of work satisfaction variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
      • For example, "4.96% of the respondents are very dissatisfied" etc.
    • We interpret the valid percentage column (valid.prc).

    Recoded frequency table interpretation sample

    The recoded level of work satisfaction variable shows that 4.96% of the respondents are very dissatisfied; 10.57% of the respondents are a little dissatisfied; 42.70% of the respondents are moderately satisfied; and 41.77% of the respondents are very satisfied with the work they do.

[[Transforming continuous variables into groups]]⚓︎

  • For our analysis, we may want to recode [[continuous]] variables and create [[categorical]] groups. This is also, in a way, merging the values.

    • Take educ variable in GSS (“What is the highest year of school you completed?”).

      Variable name Variable label Variable type Question wording and response categories
      educ

      From: Variables in GSS
      Respondents' education in years Continuous What is the highest year of school you completed?
    • The responses are from 0 (no schooling) to 20 (20 years of schooling). All the numbers from 0 to 20 are real numbers (continuous variable).

    • For our analysis, imagine we’re interested in the income level of respondents with (1) Low level of education, (2) Moderate level of education, and (3) High level of education.
    • Then, we will merge some values and create a new variable by recoding. 1, 2, 3 are not real numbers (categorical variable).

      Transforming values: educeducgroups

      • 1, 2, 3, 4, 5, 6, 7, 8, 9, 101: Low level of education
      • 11, 12, 13, 14, 152: Moderate level of education
      • 16, 17, 18, 19, 203: High level of education
    • After recoding the continuous educ variable, which originally had 21 responses (from 0 to 20 years of schooling), our dataset will include one more variable called educgroups with 3 responses.

    • While educ is a continuous variable, educgroups is a categorical variable.

      respondent id educ educgroups
      1 16 3 (High level of education)
      2 3 1 (Low level of education)
      3 4 1 (Low level of education)
      4 13 2 (Moderate level of education)
      5 20 3 (High level of education)
      6 9 1 (Low level of education)
      7 15 2 (Moderate level of education)
      8 18 3 (High level of education)
      9 17 3 (High level of education)
      10 9 1 (Low level of education)
  • We need to inform RStudio about which numbers should be replaced with which numbers in our new variable (the recoded variable).

  • We use colon (:) to merge responses of continuous variables:

    Reversing values: educeducgroups

    • 0 : 11 = 1 [meaning: From 0 years of education to 11 years of education = 1 (Low level of education)]
    • 12 : 15 = 2 [meaning: From 12 years of education to 15 years of education = 2 (Moderate level of education)]
    • 16 : 20 = 3 [meaning: From 16 years of education to 20 years of education = 3 (High level of education)]

[[Transforming continuous variables into groups]] - coding steps⚓︎

  1. Before recoding educ variable by transforming continuous variables into groups, note that we have 982 variables in total. Original 980 variables + maritalgroups and satjobreversed.

    RStudio data view: there are 982 variables before recoding.

  2. [[Transforming continuous variables into groups]] #code structure:

    • [[Model code]]

      1
      2
      3
      4
      5
      6
      gss$new_variable_here <-
      rec(gss$original_variable_here, rec = 
      "number(from) : number(to) = 1 [Label1]; 
      number(from) : number(to) = 2 [Label2];
      number(from) : number(to) = 3 [Label3]",
      var.label =  "Recoded variable label" 
      

    • [[Working code]]

      1
      2
      3
      4
      5
      6
      gss$educgroups <- 
      rec(gss$educ, rec =
      "0 : 11 = 1 [Low level of education]; 
      12 : 15 = 2 [Moderate level of education];
      16 : 20 = 3 [High level of education]",
      var.label = "Recoded respondents' education in years")
      

    Code explanation: Click to expand
    • educgroups: New name for the recoded variable. This will be added to GSS dataset.
      • We’ll type this name. No space, no special characters. Add “groups”, “reversed”, or “recoded” at the end of the original variable name or type anything that you will remember what this variable is.
    • educ: The original variable we want to recode. The new variable will be created based on the original variable's values.
    • [Low level of education], [Moderate level of education], and [High level of education]: The new labels for the new values. These will appear on the table.
    • [var.label]: The last line is for writing the variable label of the new variable. We put the new variable's name here again, educgroups, and write this new variable's variable label here "Recoded respondents' education in years"
    • Line 1: We put the new variable name for the new recoded variable here, educgroups.
    • Line 2: We put the original variable we want to recode here, educ.
    • Lines 3-4-5 We merge values in lines 3, 4, and 5, using colon (:). "[...]" are the new labels for the new values. These will appear on our outputs.
    • Line 6: We write this new variable's variable label here "Recoded respondents' education in years".
  3. After highlighting and running the code above, GSS dataset will include one more variable as we have just created the educgroups variable.

    RStudio data view: there are 983 variables after recoding.

  4. [[Descriptive table]] #code for the original variable (educ)

    • [[Model code]]

      descr(gss$variable_here, out = "v", show = "short")
      

    • [[Working code]]

      descr(gss$educ, out = "v", show = "short") 
      

      • Line 1: We put educ here ➜ variable_here. [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  5. [[Descriptive table]] #output for the original variable (educ)

    Basic descriptive statistics

    Variable Label N Missings (%) Mean SD
    educ Respondents' education in years 3952 0.85 14.24 2.92
  6. [[Descriptive table]] #interpretation for the original variable (educ)

    Descriptive table interpretation template

    The [variable label] variable shows the average [variable label] of the respondents is [mean], with standard deviation [SD].

    • After the [variable label], we add the word of "variable" in your interpretation:
    • "The respondents' education in years variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
    • For example, "the average years of education is...", "the average weeks of working is..." etc.
    • We use the mean (Mean column) and standard deviation (SD column) in our interpretation.

    Descriptive table interpretation sample

    The respondents' education in years variable shows that the average years of education is 14.42, with standard deviation 2.92.

  7. [[Frequency table for recoded variable]] #code (educgroups)

    • Model code
      frq(gss$variable_here, out = "v")
      
    • Working code

      frq(gss$educgroups, out = "v")
      

      • Line 1: We put educgroups here ➜ variable_here.
        • Find the working code in this module's R script file.
        • [[Highlighting and running]] this code will generate the output below (which will appear in the viewer part of RStudio).
  8. [[Frequency table for recoded variable]] #output (educgroups)

    Recoded respondents' education in years

    val label frq raw.prc valid.prc cum.prc
    1 Low level of education 337 8.45 8.53 8.53
    2 Moderate level of education 2059 51.66 52.10 60.63
    3 High level of education 1556 39.04 39.37 100.00
    NA NA 34 0.85 NA NA
  9. [[Frequency table for recoded variable]] #interpretation (educgroups)

    Frequency table for recoded variables interpretation template

    The [recoded variable label] variable shows that xx.xx% of the respondents are / have / feel / think / said / reported [label 1], xx.xx% of the respondents are / have / feel / said / reported [label 2]...

    • Before the [variable label], we add the word of "recoded".
    • After the [variable label], we add the word of "variable" in your interpretation:
      • "The recoded respondents' education in years variable shows that..."
    • Depending on the variable, we need to tweak some parts of the interpretation.
      • For example, "8.53% of the respondents have low level of education" etc.
    • We interpret the valid percentage column (valid.prc).

    Frequency table for recoded variables interpretation sample

    The ==recoded respondents' education in years variable shows that 8.53% of the respondents have low level of education; 52.10% of the respondents have moderate level of education; and 39.37% of the respondents have high level of education.

How to work with recoding codes⚓︎

Step (1): Determine what kind of recoding you need⚓︎

  • [[Merging values]] (categorical to categorical),
    • For example: We have merged the values of marital and created maritalgroups.
  • [[Reversing values]] (categorical to categorical),
    • For example: We have reversed the values of satjob and created satjobreversed.
  • [[Transforming continuous variables into groups]] (continuous to categorical),
    • For example: We have transformed the values of educ (continuous variable) and created educgroups

Step (2): Determine how many values you will need in your recoded (new) variable⚓︎

  • We needed 3 values for maritalgroups,
  • We needed 5 values for satjobreversed,
  • We needed 3 values for educgroups.

Step (3): Find [[recoding model codes]]⚓︎

  • At the very bottom of this page, there are recoding model codes with every type of recoding and every kind of value possibility.

[[Common recoding issues and troubleshooting]]⚓︎

[[Different recoding codes for different variables]]⚓︎

  • Recoding a categorical variable and a continuous variable requires slightly different codes.

    gss$maritalgroups <- 
    rec(gss$marital, rec =
    "1 = 1 [Married];
    2, 3, 4 = 2 [Formerly in union];
    5 = 3 [Never married]",
    var.label = "Recoded respondents' marital status")
    
    gss$educgroups <- 
    rec(gss$educ, rec =
    "0 : 11 = 1 [Low level of education];
    12 : 15 = 2 [Moderate level of education]; 
    16 : 20 = 3 [High level of education]",
    var.label = "Recoded respondents' education in years")
    
    • Line 4: Check line 4. For categorical variables, we use comma (,) between the values for merging. In the line 4, comma means “merge 2, 3, and 4.”
    • Line 10: Check line 10. For continuous variables, we use colon (:) between the values. In the line 10, colon means “merge all numbers between 0 and 11.”

      Troubleshooting

      • We need to check the "variable type" column of the variable (in "Variables in GSS page") we recode.
        • If we recode a categorical variable, we use comma (,) between the values for merging.
        • If we recode a continuous variable, we use colon (:) between the values.

[[Use the recoded (new) variable in analyses]]⚓︎

  • When we want to display, for example, the frequency table of a recoded (new) variable, we must use the recoded (new) variable’s name in the frequency code.
  • This is because, for our analysis, the original variable is no longer relevant. We recoded the original variable and created a new one for our analysis needs.

    1
    2
    3
    frq(gss$marital, out = "v")
    
    frq(gss$maritalgroups, out = "v")
    
    • Line 1: Wrong!
    • Line 3: Correct!

      Troubleshooting

      • After the recoding process, use the recoded (new) variable in analyses. Make sure you do not use the original variable name in analyses.

[[Recoded variables are always categorical]]⚓︎

  • When we recode a continuous variable, the new (recoded) variable is no longer continuous.
  • It becomes CATEGORICAL because we have merged the real numbers, and they no longer remain as real numbers.
  • Therefore, for example, we use the frq code to see the frequency distribution.

    1
    2
    3
    descr(gss$educgroups, out = "v", show = "short")
    
    frq(gss$educgroups, out = "v")
    
    • Line 1: Wrong!
    • Line 3: Correct!

      Troubleshooting

      • Recoded variables are always categorical. Therefore, they should be treated categorical in every analyses they are used.

[[Use a model code]]⚓︎

  • We will likely make mistakes:
    • If we type manually and do not use a [[model code]] and do not compare it with our [[working code]]. The code below has two issues:
    • Check line 3. A semicolon (;) is missing at the end. The [[RStudio console]] showed this error: Error: ?Syntax error in argument "18:29=130:45=2[Middle"
    • Imagine, there was a semicolon at the end of line 3. This time, line 4 lacks of a bracket. This is a much more problematic error, because RStudio will still create agegroups and when we run a frequency table code, the table will show -Inf and 2[Middle labels.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      gss$agegroups <-
      rec(gss$age, rec =
      "18 : 29 = 1 [Young]
      30 : 45 = 2 [Middle; 
      46 : 60 = 3 [Middle-old]; 
      61 : 89 = 4 [Old]",
      var.label = "Recoded respondents' age")
      
      frq(gss$agegroups, out = "v") 
      
      • Line 4: Check line 4. A semicolon is missing at the end. The [[RStudio console]] showed this error: Error: ?Syntax error in argument "18:29=130:45=2[Middle"
      • Line 5: Imagine, there was a semicolon (;) at the end of line 4. This time, line 5 lacks of a bracket. This is a much more problematic error, because RStudio will still create agegroups and when we run a frequency table code, the table will show -Inf and 2[Middle labels.

      Recoded Respondents' age (x)

      val label frq raw.prc valid.prc cum.prc
      -Inf 127 3.19 3.19 3.19
      1 606 15.20 15.20 18.39
      2[Middle 1196 30.01 30.01 48.39
      3 874 21.93 21.93 70.32
      4 1183 29.68 29.68 100.00
      NA NA 0 0.00 NA NA

      Troubleshooting

[[Refresh GSS data if variables are misplaced]]⚓︎

  • If variables are misplaced in the codes and have overwritten the original values, we have to have original GSS data again, because we lost the values of the original variable and we need a fresh data.
    • From time to time, we may accidentally change the values of original variables (especially when we recode variables).
    • When this happens, we go to the very top of the R script file, and highlight and run the "Refresh data and packages" code. If we created new variables previously, we will need to run those codes under our working space again in order since this will be a fresh data.

      1. Imagine, we accidentially place the original variable, educ, whose values we want to change, in the first part of the code. We did run the code and the values of the educ variable is now lost.
      2. Note that in addition to 980 original variables, we have created 3 more variables so far, which adds up to 983 variables in total.
      3. We need to go to the top of the R script file, and highlight and run the "Refresh data and packages" code. That line is there for this exact reason. We normally do not run that code in our sessions.
      4. Note that now we have 980 variables. The 3 variables we created are gone. We will run the codes under our working space again in order since this is a new,fresh data.

        A two-part screenshot sequence in RStudio shows how refreshing the GSS data restores overwritten variables. In the top image, the code gss\(educ <- rec(gss\)educ, ...) is highlighted, with callout 1 pointing to gss$educ and callout 2 showing the Environment pane listing 983 variables. In the bottom image, callout 3 highlights the source(url(...)) line used to refresh the data, and callout 4 shows the Environment pane now listing 980 variables, indicating the dataset has been reset to a fresh version.

        Troubleshooting

        • Mistakes happen. For example, we could put the new variable name into the wrong part of the code. When this happens, the values of the original variable are lost.
          • Therefore, we should highlight and run the "Refresh data and packages" code.
          • We should run each code again before the wrong code, because they are also lost.

[[Run the recoding codes to create a new variable]]⚓︎

  • Let’s say we want to recode an existing variable and therefore create a new variable. Then we want to create a frequency table of the new (recoded) variable.
  • Preparing the recoding code does not mean we created a new variable. We need to run the recoding code so the frequency code can work. They need to be run in order.

    • For example, the frq(gss$maritalgroups, out = "v") code didn’t work below, and it yielded an unknown or uninitialised column: ‘maritalgroups’ error.
    • Even though the recoding code that generates the maritalgroups variable exists, we didn’t highlight and run it, so the data doesn’t actually include maritalgroups yet.

      A screenshot shows only the frq(gss$maritalgroups, out = "v") line highlighted and run, without first running the recoding code above it. In the Console, the output is NULL followed by a warning message stating “Unknown or uninitialised column: maritalgroups,” showing that the new variable does not exist yet because the recoding code was not run first.

    • Below, it works because:

      1. We did highlight and run the recoding code, and
      2. We did highlight and run the frequency code. They need to be run in order. Alternatively, we could highlight both and run.

      A screenshot shows recoding code highlighted and run before the frequency command. Callout 1 marks the block of code that creates the new variable gss\(maritalgroups and assigns it a label, and callout 2 marks the frq(gss\)maritalgroups, out = "v") command. On the right, the resulting frequency table for “Recoded respondents' marital status” appears, confirming the new variable was created successfully.

      Troubleshooting

      • Always run the recoding codes before running the frequency codes, or any other codes including the new (recoded) variable.
      • If we do not remember if we did run it before, we run it again.

[[Recoding model codes]]⚓︎

Merging values model codes⚓︎

[[Merging values]] #code with 2 values⚓︎

1
2
3
4
5
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number1, number2 = 1 [label1]; 
number3 = 2 [label2]",
var.label = "Recoded variable label")

[[Merging values]] #code with 3 values⚓︎

1
2
3
4
5
6
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number1, number2 = 1 [label1]; 
number3 = 2 [label2];
number4, number5 = 3 [label3]",
var.label = "Recoded variable label")

[[Merging values]] #code with 4 values⚓︎

1
2
3
4
5
6
7
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number1, number2 = 1 [label1]; 
number3, number4= 2 [label2];
number5 = 3 [label3];
number6, number7 = 4 [label4]",
var.label = "Recoded variable label") 

[[Merging values]] #code with 5 values⚓︎

1
2
3
4
5
6
7
8
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number1, number2 = 1 [label1]; 
number3, number4= 2 [label2];
number5, number6= 3 [label3];
number7, number8= 4 [label4];
number9, number10= 5 [label5]",
var.label = "Recoded variable label") 

[[Merging values]] #code with 6 values⚓︎

1
2
3
4
5
6
7
8
9
gss$new_variable_here <-  # (1)! 
rec(gss$original_variable_here, rec = # (2)!
"number1, number2 = 1 [label1]; 
number3, number4 = 2 [label2];
number5, number6 = 3 [label3];
number7, number8, number9 = 4 [label4];
number10 = 5 [label5];
number11, number12 = 6 [label6]",
var.label = "Recoded variable label") 

[[Merging values]] #code with 7 values⚓︎

gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number1, number2 = 1 [label1]; 
number3, number4= 2 [label2];
number5, number6= 3 [label3];
number7, number8= 4 [label4];
number9, number10= 5 [label5];
number11, number12= 6 [label6];
number13, number14= 7 [label7]",
var.label = "Recoded variable label") 

Reversing values model codes⚓︎

[[Reversing values]] #code with 2 values⚓︎

1
2
3
4
5
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"1 = 2 [label1]; 
2 = 1 [label2]",
var.label = "Recoded variable label")

[[Reversing values]] #code with 3 values⚓︎

1
2
3
4
5
6
gss$new_variable_here <- 
rec(gss$original_variable_here, rec =
"1 = 3 [label1];
2 = 2 [label2]; 
3 = 1 [label3]",
var.label = "Recoded variable label")

[[Reversing values]] #code with 4 values⚓︎

1
2
3
4
5
6
7
gss$new_variable_here <- 
rec(gss$original_variable_here, rec =
"1 = 4 [label1]; 
2 = 3 [label2];
3 = 2 [label3];
4 = 1 [label4]",
var.label = "Recoded variable label")

[[Reversing values]] #code with 5 values⚓︎

1
2
3
4
5
6
7
8
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"1 = 5 [label1]; 
2 = 4 [label2];
3 = 3 [label3];
4 = 2 [label4];
5 = 1 [label5]",
var.label = "Recoded variable label")

[[Reversing values]] #code with 6 values⚓︎

1
2
3
4
5
6
7
8
9
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"1 = 6 [label1]; 
2 = 5 [label2];
3 = 4 [label3];
4 = 3 [label4];
5 = 2 [label5];
6 = 1 [label6]",
var.label = "Recoded variable label")

[[Reversing values]] #code with 7 values⚓︎

gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"1 = 7 [label1]; 
2 = 6 [label2];
3 = 5 [label3];
4 = 4 [label4];
5 = 3 [label5];
6 = 2 [label6];
7 = 1 [label7]",
var.label = "Recoded variable label")

Transforming continuous variables into groups model codes⚓︎

[[Transforming continuous variables into groups]] #code with 2 values⚓︎

1
2
3
4
5
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2]",
var.label = "Recoded variable label")

[[Transforming continuous variables into groups]] #code with 3 values⚓︎

1
2
3
4
5
6
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2];
number(from) : number(to) = 3 [Label3]",
var.label = "Recoded variable label")

[[Transforming continuous variables into groups]] #code with 4 values⚓︎

1
2
3
4
5
6
7
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2];
number(from) : number(to) = 3 [Label3];
number(from) : number(to) = 4 [Label4]",
var.label = "Recoded variable label")

[[Transforming continuous variables into groups]] #code with 5 values⚓︎

1
2
3
4
5
6
7
8
gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2];
number(from) : number(to) = 3 [Label3];
number(from) : number(to) = 4 [Label4];
number(from) : number(to) = 5 [Label5]",
var.label = "Recoded variable label")

[[Transforming continuous variables into groups]] #code with 6 values⚓︎

1
2
3
4
5
6
7
8
9
gss$new_variable_here <- 
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2];
number(from) : number(to) = 3 [Label3];
number(from) : number(to) = 4 [Label4];
number(from) : number(to) = 5 [Label5];
number(from) : number(to) = 6 [Label6]",
var.label = "Recoded variable label")

[[Transforming continuous variables into groups]] #code with 7 values⚓︎

gss$new_variable_here <-
rec(gss$original_variable_here, rec =
"number(from) : number(to) = 1 [Label1]; 
number(from) : number(to) = 2 [Label2];
number(from) : number(to) = 3 [Label3];
number(from) : number(to) = 4 [Label4];
number(from) : number(to) = 5 [Label5];
number(from) : number(to) = 6 [Label6];
number(from) : number(to) = 7 [Label7]",
var.label = "Recoded variable label")