This section introduces the R environment and some of the most basic funcionality aspects of R that are used through the remainder of the class. This section assumes little to no experience with statistical computing with R. We will introduce the R statistical computing environment, RStudio, and the dataset that we will work with for the remainder of the lesson. We will cover very basic functionality in R, including variables, functions, and importing/inspecting data frames.

Make sure you complete the setup here prior to the class.

RStudio

Let’s start by learning about RStudio. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier.

  • Options: First, let’s change a few options. We’ll only have to do this once. Under Tools… Global Options…:
    • Under General: Uncheck “Restore most recently opened project at startup”
    • Under General: Uncheck “Restore .RData into workspace at startup”
    • Under General: Set “Save workspace to .RData on exit:” to Never.
    • Under General: Set “Save workspace to .RData on exit:” to Never.
    • Under R Markdown: Uncheck “Show output inline for all R Markdown documents”
  • Projects: first, start a new project in a new folder somewhere easy to remember. When we start reading in data it’ll be important that the code and the data are in the same place. Creating a project creates an Rproj file that opens R running in that folder. This way, when you want to read in dataset whatever.txt, you just tell it the filename rather than a full path. This is critical for reproducibility, and we’ll talk about that more later.
  • Code that you type into the console is code that R executes. From here forward we will use the editor window to write a script that we can save to a file and run it again whenever we want to. We usually give it a .R extension, but it’s just a plain text file. If you want to send commands from your editor to the console, use CMD+Enter (Ctrl+Enter on Windows).
  • Anything after a # sign is a comment. Use them liberally to comment your code.

Basic operations

R can be used as a glorified calculator. Try typing this in directly into the console. Make sure you’re typing into into the editor, not the console, and save your script. Use the run button, or press CMD+Enter (Ctrl+Enter on Windows).

2+2
5*4
2^3

R Knows order of operations and scientific notation.

2+3*4/(5+3)*15/2^2+3*4^2
5e4

However, to do useful and interesting things, we need to assign values to objects. To create objects, we need to give it a name followed by the assignment operator <- and the value we want to give it:

weight_kg <- 55

<- is the assignment operator. Assigns values on the right to objects on the left, it is like an arrow that points from the value to the object. Mostly similar to = but not always. Learn to use <- as it is good programming practice. Using = in place of <- can lead to issues down the line. The keyboard shortcut for inserting the <- operator is Alt-dash.

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid but x2 is). R is case sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names, which we’ll get into shortly (e.g., c, T, mean, data, df, weights). In doubt check the help to see if the name is already in use. It’s also best to avoid dots (.) within a variable name as in my.dataset. It is also recommended to use nouns for variable names, and verbs for function names.

When assigning a value to an object, R does not print anything. You can force to print the value by typing the name:

weight_kg

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg).

2.2 * weight_kg

We can also change a variable’s value by assigning it a new one:

weight_kg <- 57.5
2.2 * weight_kg

This means that assigning a value to one variable does not change the values of other variables. For example, let’s store the animal’s weight in pounds in a variable.

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 220?

You can see what objects (variables) are stored by viewing the Environment tab in Rstudio. You can also use the ls() function. You can remove objects (variables) with the rm() function. You can do this one at a time or remove several objects at once. You can also use the little broom button in your environment pane to remove everything from your environment.

ls()
rm(weight_lb, weight_kg)
ls()
weight_lb # oops! you should get an error because weight_lb no longer exists!

EXERCISE 1

What are the values after each statement in the following?

mass <- 50              # mass?
age  <- 30              # age?
mass <- mass * 2        # mass?
age  <- age - 10        # age?
mass_index <- mass/age  # massIndex?

Functions

R has built-in functions.

# Notice that this is a comment.
# Anything behind a # is "commented out" and is not run.
sqrt(144)
log(1000)

Get help by typing a question mark in front of the function’s name, or help(functionname):

help(log)
?log

Note syntax highlighting when typing this into the editor. Also note how we pass arguments to functions. The base= part inside the parentheses is called an argument, and most functions use arguments. Arguments modify the behavior of the function. Functions some input (e.g., some data, an object) and other options to change what the function will return, or how to treat the data provided. Finally, see how you can next one function inside of another (here taking the square root of the log-base-10 of 1000).

log(1000)
log(1000, base=10)
log(1000, 10)
sqrt(log(1000, base=10))

EXERCISE 2

See ?abs and calculate the square root of the log-base-10 of the absolute value of -4*(2550-50). Answer should be 2.


Data Frames

There are lots of different basic data structures in R. If you take any kind of longer introduction to R you’ll probably learn about arrays, lists, matrices, etc. We are going to skip straight to the data structure you’ll probably use most – the data frame. We use data frames to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented as columns; heterogeneous, meaning that columns/features/variables can be different classes (on variable, e.g. age, can be numeric, while another, e.g., cause of death, can be text).

Let’s move on to learning about data frames.