Skip to content

02. Introduction to data and scripting⚓︎

Module items⚓︎

R Script file⚓︎

Copy the code below ➜ Paste into [[RStudio console]] ➜ Hit Enter

source(url("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/0_packages_data.R")); 
download.file("https://raw.githubusercontent.com/ttezcann/ssric-reg/refs/heads/main/regression/docs/assets/r-scripts/02-intro-scripting.R", "02-intro-scripting.R"); 
file.edit("02-intro-scripting.R")

Lab assignment⚓︎

Keyboard shortcuts and scripting

Sample lab assignment⚓︎

Sample: Keyboard shortcuts and scripting

Learning outcomes⚓︎

  1. Learn the key terminologies of empirical research (questionnaire, respondents, data)
  2. Learn the key terminologies of data (variable name, values, labels, response set)
  3. Learn what data science is and how it works
  4. Learn scripting and using R script files
  5. Learn keyboard and mouse shortcuts

Suggested reading⚓︎

  • 📖
    Sturgis, Patrick, and Rebekah Luff. 2021. “The Demise of the Survey? A Research Note on Trends in the Use of Survey Data in the Social Sciences, 1939 to 2015.” International Journal of Social Research Methodology 24(6):691–96. doi:10.1080/13645579.2020.1844896

[[Terminologies]]⚓︎

[[Survey terminology]]⚓︎

A flow diagram shows the survey process: questionnaire, respondents, data, then dataset. The caption says: “We ask a set of questions to a group of people, then record their responses and create a dataset file.” A flow diagram shows how survey answers become a dataset. It moves from a questionnaire with coded response options, to respondents identified as 8,634 self-identified Latino/Hispanic residents of the US, to selected responses marked on the questionnaire, and finally to a dataset with columns named DFIRED, DBADPOLC, and DHOUSING.

  • [[Questionnaire]]: A set of written questions used for collecting information from respondents.
  • [[Respondents]]: Individuals who respond to the questions in a questionnaire.
  • [[Data]]: The information collected from respondents. The numbers to be analyzed.
  • [[Dataset]]: The information collected from respondents. The numbers to be analyzed.

[[Data terminology]]⚓︎

A diagram links parts of a questionnaire to a dataset. Arrows show the question wording, variable label, the short code such as DFIRED becomes the variable name, the numbered choices 1, 2, and 3 are the values stored in the dataset, and the text choices Yes, No, and DK/NA are the value labels, value and value label together are called response category.

  • The information above are provided in the Variables in GSS lab resources page.
  • We'll be using this page for all modules.

A diagram uses a table and frequency output to identify four terms: variable name, variable label, value, and value label. In the example, the variable name is happy, the variable label is Happiness level, the value is 3, and the value label is Not too happy, with the question wording shown as “Would you say that you are very happy, pretty happy, or not too happy?”

  • [[Question wording|ref]]: The exact text of a question as it appears in the questionnaire.
  • [[Variable name|ref]]: Unique words assigned to each question. We use variable names in data analysis software.
  • [[Variable label|ref]]: Explains what the question is about. We use variable labels in our interpretations.
  • [[Value|ref]]: Numbers such as 1, 2, 3, etc., that appear in the dataset representing specific responses.
  • [[Value label|ref]]: What those values (numbers) mean, e.g., 1: yes, 2: no, etc.
  • [[Response category|ref]]: The combination of values and their corresponding labels.

What is data science?⚓︎

  • Data science is a discipline that allows you to turn raw data into understanding, insight, and knowledge.
  • How data science works? A workflow diagram shows the data science process: Import, Tidy, and Transform lead into an iterative cycle of Visualize, Model, and Transform, labeled Understand, and then continue to Communicate.

    1. Importing data means that you take data stored in a file and load it into a data frame in R.
    2. Tidying your data means each column is a variable, and each row is an observation.
    3. Transformation includes narrowing in on observations of interest.
    4. Visualization will show you things that you did not expect.
    5. Models are a mathematical or computational tool.
    6. Communicating your results to others.

[[Using R script files]]⚓︎

  • We will follow certain workflows when it comes to using R script files.
    • An [[R script file]] is simply a text file containing a set of codes and notes. The script can be saved and used later to re-execute the saved codes. The script can also be edited so you can execute a modified version of the codes.
      • Reproducibility: The ability to re-create a past analysis.
      • Automation: The ability to rapidly re-create an analysis when data changes.
      • Communication: Code is just text, so it is easy to communicate.

[[Highlighting and running]]⚓︎

In the script pane, a line of code is selected. The numbered callouts show the sequence: 1) select the code, 2) click Run, and 3) view the output table in the lower-right pane.

  1. We highlight the codes
  2. And, click “Run”
  3. Clicking “Run” generates the analysis (a frequency table for this example)

    Highlighting and running

    • As R script files are simply text files, we need to highlight the codes and run. Without highlighting and running, the codes will not work.

[[Outline view]]⚓︎

 RStudio shows the document outline feature. Callout 1 shows a heading, callout 2 marks the Outline button in the script pane, and callout 3 shows the outline panel listing the script’s section headings so you can jump to different parts of the file.

  1. The R script files in the modules use comments as headings and subheadings to introduce the type of analysis; we always read these before running the code.
    1. Following these headings, ---- is used so that the heading levels are displayed with appropriate indentations in the [[outline view]].
  2. Click on the menu icon to open the outline view.
  3. Click on the headings in the outline view to see them in the R script file.

[[Commenting]]⚓︎

  • Commenting on R script files is important to help you remember exactly what you did and why you made specific choices when you revisit the file months or years later.
  • A well-annotated R script file allows your colleagues (or your future self) to easily understand, trace, and recreate your analytical step.

RStudio screen showing how to add comments in an R script with the hash symbol (#). The numbered callouts show: 1, a correctly written comment after code; 2, text entered without # that produces an error; and 3, the red error symbol marking that line.

  1. To write a comment, type the hashtag symbol (#) followed by your text.
  2. R is programmed to completely ignore any text that comes after a hashtag on a given line.
    1. Because R does not have a built-in feature for large blocks of text, you must place a # at the beginning of every single line if your comment spans multiple lines.
  3. When an hashtag is not used, R gets confused and shows an error.
    1. Look at the red cross on line 50. When there is a red cross on the left side of the line number, there is something wrong with our codes.

[[Saving R script files]]⚓︎

  • Regularly saving R script files is another step. RStudio script pane shows how to save an R script file. Callout 1 marks the script tab for the open file, and callout 2 marks the Save button.
  1. When we make any changes, the font of the file name will be red with an asterisk (*)
  2. To save the R script file, click “Save.”
    1. The R script file name in black means no changes have been made or saved.

[[Working space|ref]]⚓︎

  • Working space is the designated section at the bottom of the R script where we will edit codes for assignments. RStudio shows the Outline button and the WORKING SPACE section in the script. Callout 1 points to the WORKING SPACE heading in the outline panel, and callout 2 marks the matching WORKING SPACE section in the script, with a note stating: “Do not edit or change anything on R script files except under ‘WORKING SPACE’.”
  1. For easy navigation click [[outline view]] to see the headings and subheadings. Click "working space." Alternatively, scroll down on the R script file.
  2. The codes for assignments will be put under the “working space." We do not edit or change anything on R script files except under "working space." Anything above the “working space” is teaching material!

Pasting variable names⚓︎

  • [[Pasting variable names]] is one of the most important workflow steps. It is very common to miswrite codes, forget commas, etc.

    • Therefore, we only change the variable names inside the codes.
    • We NEVER type variable names or codes.
      • We always Copy   Ctrl+C  /  Cmd+C the variable names (from the templates page or assignments), and
      • Paste   Ctrl+V  /  Cmd+V into our codes.
  • There is no variable called “maritaal”, but “marital.”

    • RStudio warns us that “maritaal” is “unknown.” We copy and paste variable names to avoid this possibility.

RStudio shows a variable name pasted into code and the resulting warning in the Console. The code uses maritaal, and the Console says: “Unknown or uninitialised column: maritaal.”

[[Keyboard shortcuts]]⚓︎

  • The most frequently used keyboard shortcuts are copy-paste-undo.

    • Do not use mouse right click for these functions.

    Copy:  Ctrl+C
    Paste:  Ctrl+V
    Undo:  Ctrl+Z

    Copy:  Cmd+C
    Paste:  Cmd+V
    Undo:  Cmd+Z

[[Hand and finger positions]]⚓︎

  • When using the keyboard shortcuts, do not use both hands. The ideal hand and finger positions are shown below:

    1. Little finger is on Ctrl and index or middle finger on letters: C - V - Z 
    2. Do not use both hands. Your other hand should be on the mouse (or trackpad).
      A side-by-side image about using keyboard shortcuts correctly. On the left, photo shows one hand on a keyboard, with the little finger pressing the Control key and the index finger pressing a letter key, demonstrating the correct hand position. On the right, photo shows both hands on the keyboard with a large red “X” over the image, indicating incorrect technique.
    1. Thumb finger is on Cmd and index or middle finger on letters: C - V - Z 
    2. Do not use both hands. Your other hand should be on the mouse (or trackpad).
      A side-by-side image about using keyboard shortcuts correctly. On the left, photo shows one hand on a keyboard, with the thumb finger pressing the Command key and the index finger pressing a letter key, demonstrating the correct hand position. On the right, photo shows both hands on the keyboard with a large red “X” over the image, indicating incorrect technique.

[[Mouse shortcuts]]⚓︎

  • When it comes to copying, pasting, or replacing variables or codes, we use the following mouse/trackpad shortcuts:

    • Do not highlight the existing variable name to replace it with a new variable.

      • DOUBLE CLICK on it with your mouse/trackpad.

        alt text

    • [Single line] Do not highlight all the line to copy or run the code.

      • TRIPLE CLICK with your mouse (click three times really fast).

        alt text

    • [Multiple lines] Highlight with your mouse, carefully.

      alt text

[[How to work with codes]]?⚓︎

  • We never type the codes or variables inside the codes. Instead, we use model code and working code:
    • (1) [[Model code | ref]]:
      • Model code is a template that shows the correct code structure without being tied to a specific variable. It is a working line of code that serves as a reference and is never edited directly.
    • (2) [[Working code | ref]]:
      • Working code is a copy of the model code edited to include an actual variable from the dataset.
  • Imagine we need a frequency table for the sex variable.

    1. Find the frequency table code from the R script file or on module pages, and copy.
    2. Paste it under the “[[working space]]” of our R script file.
    3. Hit Enter and add a blank line.
    4. Paste the model code again.
    5. The first code is the model code, and the second code is the working code that we will edit.

      1
      2
      3
      4
      5
      # WORKING SPACE
      
      frq(gss$variable_here, out = "v")
      
      frq(gss$variable_here, out = "v")
      

      • Line 3: This is a model code. We copied this code from the R script file or on module pages (here), paste into R script file twice.
      • Line 5: This is the working code that we'll edit. Next, we will replace variable_here part with sex.
    6. Copy sex and paste to replace it with variable_here.

      1
      2
      3
      4
      5
      # WORKING SPACE
      
      frq(gss$variable_here, out = "v")
      
      frq(gss$sex, out = "v")
      
      • Line 3: This is a model code. We copied this code from the R script file or on module pages (here), paste into R script file twice.
        • If our working code doesn't work, we compare it to the model code to troubleshoot. Maybe we accidentally deleted the comma.
      • Line 5: This is the working code. We replaced variable_here part with sex. [[Highlighting and running]] this code will generate the output.