1 Lab instructions

The goal of this lab is to introduce you to R and RStudio (which you’ll be using throughout the course). To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.

We will also continue to build upon your understanding of vectors in R, and to learn some functions that will give you insight into the vectors you are working with.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface and some basic commands.

In today’s lab, you will:

  • First, download both R and R Studio (see the first two lecture videos for instructions).
  • Open the R script file Lab-1A.R in RStudio.
  • Read these instructions, as well as the comments in the Rscript, carefully.
  • Complete the lab and submit what is required the Canvas submission form.

2 Review of the R interface

2.1 The Console vs The RScript

2.1.1 The Console

As stated above, R is the name of the programming language itself and RStudio is a convenient interface. The place in which this R Studio interface runs the R programming language is in the (R) Console! The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command.

To get you started, enter the following command at the R prompt (i.e. right after > on the console). You can either type it in manually or copy and paste it from this document.

print("Hello World!!!")
## [1] "Hello World!!!"

Now let’s type multiple lines. You can either type it in manually or copy and paste it from this document.

print("Hello World!!!")
print("My name is _YOUR_NAME_HERE_")
print("I am coding in the console :)")

If we want to edit the code to instead be:

print("Howdy World!!!")
print("I am coding in the console :)")
print("Mi nombre es _YOUR_NAME_HERE_")

we would have to:

  • Place our cursor in the console prompt, >
  • Press \(\uparrow\) on our keyboard to cycle through the history.
  • Make changes and run the result one line at a time.

This is not easy or intuitive. That is why most of our coding will be done in Rscript files.

2.1.2 The Rscript File

The above example illustrates the limitation of the console. You can type commands directly into the console, but they will be forgotten when you close the session. It is better to enter the commands in the script editor, and save the script. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed. While you can copy-paste directly into the R console, the Rstudio script editor allows you to ‘send’ the current line or the currently selected text to the R console using the Ctrl+Enter shortcut.

If you have not opened the Lab-1A.R file yet, please do so. Then, at the top, you can either type or paste the following code.

print("Hello World!!!")
print("My name is _YOUR_NAME_HERE_")
print("I am coding in the console :)")

The .R file is simply a text document and its contents can be sent down into the console. You can:

  • Click on a line of code, then click the Run button at the top of the window.
  • Highlighting all of these lines, then click the Run button at the top of the window.
  • Click on a line of code, then holding down Ctrl or Cmnd and hit Enter repeatedly until each line has been submitted to the console.
  • Alternatively, you can highlight all of these lines, then holding down Ctrl/Cmnd and hit Enter just once.

Now, if we want to make changes to our code, it’s as easy as changing an essay written in MS Word or GoogleDocs; we simply rearrange the lines with cut-and-paste, edit the lines, and submit the final version of the new lines to our console! We can then save our script and reopen it at a later date.

In your Lab-1A.R file, make changes to the code and re-run it. Then continue to the next section.


3 Basic Data Types

There are several data types in R .

  • real numbers
  • integers
  • complex numbers of the form \(a + bi\)
  • logicals
  • characters/strings

We will explore each of these (with the exception of complex numbers) throughout this lab.


4 Variable assignment

As mentioned in the video, we can assign values in one of two ways: Either with the = sign, or with <-.

There are some practical reasons (the = sign is used to assign arguments in functions) as well as advanced reasons to favor the <- assignment operator (which are beyond the scope of this course), so try to stick with <- in your code.

# Assign the value of `10` to the variable `x`
x <- 10
# Simply run a line with only the variable's name to PRINT TO CHECK
x
## [1] 10

This assignment operation can be repeated indefinitely in the case that you might need to assign the same value to multiple variables, as illustrated below.

Find this chunk of code in your Rscript and run each line by clicking anywhere on the first line of code, holding down Ctrl and hitting Enter until each line has been submitted to the console. Alternatively, you can highlight all of these lines, then holding down Ctrl and hitting Enter just once.

# Assign the value of `10` to the variables `x`, `y`, and `z`
x <- y <- z <- n <- 10

#Print to check
x
## [1] 10
y
## [1] 10
z
## [1] 10
n
## [1] 10

These values can even be used to modify themselve using the assignment operator.

y

y <- y + 1
y

R provides helper functions to allow us to inspect these variables. For instance, we may wish to inspect what type/class our variable is.

class(x)

4.1 Assign to save changes

Assignment of a value to a variable is very useful, as it allows us to store a given value in an abstract form, which can be used for later calculations.

Take note, however, that performing an action on a variable DOES NOT MODIFY THAT OBJECT, even if a modified version is printed to the console.

To illustrate, suppose we are performing a \(t\)-test to determine if the true mean GPA for a given class will be 3.0. We have a sample of \(n=10\) students, a sample mean of \(\bar{x} = 3.666\) and a sample standard deviation of \(s = 1.212\). The equation for \(t\) is:

\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{3.666 - 3}{1.212 / \sqrt{10}} = 1.7377\] A novice coder might do the following:

xbar <- 3.666
xbar

m0 <- 3
m0

stddev <- 1.212
stddev

# Store the sample size
n <- 10
# Print to check
n
# Take the square root of the sample size
sqrt(n)

t_stat <- (xbar - m0) / (stddev / n)
t_stat

So to review:

- we've assigned all of the values appropriately
- then took the square root of the sample size
- and finally used all of the values in the correct equation

What value did you get for t_stat??

t_stat

However, is the value correct?? NO!!!

To see why, let’s double check the value of \(n\) now that we’ve applied the square root function to it.

n
## [1] 10

The variable n is still 10! This is because applying a function to a variable returns the result, but it does not modify the variable passed to it!

n
sqrt(n)
n

In order to use the return value of ANY function, we need to assign that return value to its own variable.

sqrt_n <- sqrt(n)

t_stat <- (xbar - m0) / (stddev / sqrt_n)
t_stat

Now that value of our \(t\) statistic is correct.

4.2 Type conversion

R also provides useful functions for “coerce” variables from one type to another. These all start with as. followed by the desired class.

Take for example the variable x with the numeric value of 10. We can convert it to a character.

class(x)
## [1] "numeric"
x_str <- as.character(x)
x_str
## [1] "10"
class(x_str)
## [1] "character"

Now that we have a character version of the number 10, we cannot perform mathematics operations upon it unless we coerce it back to numeric.

x_str^2

as.numeric(x_str)^2

R also provide checker functions to assess if a variable is of a certain class, as illustrated here.

is.integer(x)
## [1] FALSE
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
is.numeric(x_str)
## [1] FALSE
is.character(x_str)
## [1] TRUE

5 Vectors

So far we have only dealt with single numbers (referred to as “scalars” in mathematics). However, in R all variables are kept as vectors, even if their length is ‘1’.

5.1 Numeric vectors

Numeric type vectors come in two basic classes: numeric and integer. The only difference is that integers don’t allow decimal places, and therefore save a bit of memory. However, we will not concern ourselves with this type of memory savings.

decimal_nums <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
decimal_nums
##  [1]  1  2  3  4  5  6  7  8  9 10
integer_nums <- 1:10
integer_nums
##  [1]  1  2  3  4  5  6  7  8  9 10
class(decimal_nums)
## [1] "numeric"
class(integer_nums)
## [1] "integer"

R will automatically convert types if necessary, such as dividing an integer by 2.

integer_nums / 2
##  [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
class(integer_nums / 2)
## [1] "numeric"

5.1.1 Summary functions

Working with numeric vectors, we are able to obtain descriptive statistics on the vector. Using the code below, determine the functionality of each function.

mixed_nums
##  [1]  9  4  7  1  2  5  3 10  6  8
min(mixed_nums)
## [1] 1
max(mixed_nums)
## [1] 10
range(mixed_nums)
## [1]  1 10
head(mixed_nums)
## [1] 9 4 7 1 2 5
tail(mixed_nums)
## [1]  2  5  3 10  6  8
sort(mixed_nums)
##  [1]  1  2  3  4  5  6  7  8  9 10
sort(mixed_nums, decreasing = TRUE)
##  [1] 10  9  8  7  6  5  4  3  2  1

Recall, simply applying the sort function to our variable DOES NOT MODIFY THE VARIABLE (though it will print the result to the console).

Note that even though we applied the sort function to the variable mixed_nums, the variable itself is not sorted.

mixed_nums
##  [1]  9  4  7  1  2  5  3 10  6  8

Instead, if we want to keep the sorted result, we need to store it in its own variable.

In your Rscript file, change the below ... to match the below code

decending_nums <- sort(mixed_nums, decreasing = TRUE)
decending_nums
##  [1] 10  9  8  7  6  5  4  3  2  1

5.1.2 More summary functions

To introduce you to additional summary functions, run the associated code (shown below) in your own Rscript file. Similar to before, learn what each function is doing to the vector x. Note, the first 2 lines are generating data. You will learn more about data generation in the course.

set.seed(2021)
x <- c(rbinom(20, 10, 0.7), rbinom(10, 10, 0.3))
x
##  [1]  7  6  6  8  7  6  7  8  6  4 10  6  7  7  6  8  7  5  4  7  1  6  3  2  3
## [26]  5  5  5  5  4
x_sorted <- sort(x)
x_sorted
##  [1]  1  2  3  3  4  4  4  5  5  5  5  5  6  6  6  6  6  6  6  7  7  7  7  7  7
## [26]  7  8  8  8 10
min(x)
## [1] 1
max(x)
## [1] 10
range(x)
## [1]  1 10
sum(x)
## [1] 171
length(x)
## [1] 30
total <- sum(x)
n <- length(x)
avg <- total / n
avg
## [1] 5.7
mean(x)
## [1] 5.7
median(x)
## [1] 6
unique(x)
## [1]  7  6  8  4 10  5  1  3  2
table(x)
## x
##  1  2  3  4  5  6  7  8 10 
##  1  1  2  3  5  7  7  3  1
freq_dist <- table(x)
freq_dist
## x
##  1  2  3  4  5  6  7  8 10 
##  1  1  2  3  5  7  7  3  1
class(freq_dist)
## [1] "table"
quantile(x, 0.5)
## 50% 
##   6
quantile(x, 0.25)
## 25% 
##   5
quantile(x, 0.75)
## 75% 
##   7
var_x <- var(x)
var_x
## [1] 3.734483
sqrt(var_x)
## [1] 1.932481
sd(x)
## [1] 1.932481

5.1.3 Missing values

There are several classes of missing values, each with their own range of meaning.

  • NULL is an object and is returned when an expression or function results in an undefined value or has zero length. In R language, NULL (capital letters) is a reserved word and can also be the product of importing data with unknown data type.

  • NA is a logical constant of length 1 and is an indicator for a missing value.NA (capital letters) is a reserved word and can be coerced to any other data type vector (except raw) and can also be a product when importing data. NA and “NA” (as presented as string) are not interchangeable. NA stands for “Not Available”.

  • NaN stands for “Not A Number” and is a logical vector of a length 1 and applies to numerical values, as well as real and imaginary parts of complex values, but not to values of integer vector. NaN is a reserved word.

  • Inf and -Inf stands for infinity (or negative infinity) and is a result of storing either a large number or a product that is a result of division by zero. Inf is a reserved word and is – in most cases – product of computations in R language and therefore very rarely a product of data import. Infinite also tells you that the value is not missing and a number!

0 / 0
## [1] NaN
1 / 0
## [1] Inf
10^10 / Inf
## [1] 0
sin(Inf)
## Warning in sin(Inf): NaNs produced
## [1] NaN
cos(-Inf)
## Warning in cos(-Inf): NaNs produced
## [1] NaN
tan(Inf)
## Warning in tan(Inf): NaNs produced
## [1] NaN
args(mean)
## function (x, ...) 
## NULL

With respect to numeric operations, one must be aware of how missing values may affect results. For instance, any numeric operation involving a missing value will return a missing value.

x
##  [1]  7  6  6  8  7  6  7  8  6  4 10  6  7  7  6  8  7  5  4  7  1  6  3  2  3
## [26]  5  5  5  5  4
mean(x)
## [1] 5.7
x[3] <- NA
x
##  [1]  7  6 NA  8  7  6  7  8  6  4 10  6  7  7  6  8  7  5  4  7  1  6  3  2  3
## [26]  5  5  5  5  4
mean(x)
## [1] NA

However, R does provide OPTIONS to allow you to calculate the mean while omitting the missing values. We will cover these function options/arguments in future lessons.

?mean
mean(x, na.rm = TRUE)
## [1] 5.689655

Similar strangeness occurs with logical comparisons; If we use an NA in a logical comparison, we will get an NA in return.

miss <- NA
miss
## [1] NA
miss == NA
## [1] NA

For this reason, all four null/missing data types have accompanying logical functions available in base R; returning the TRUE / FALSE for each of particular function:

  • is.null()
  • is.na()
  • is.nan()
  • is.infinite()

You may even notice that R Studio offers of little warning in the form of a yellow triangle next to the line miss == NA. If you hover over the triangle, it will tell you to use the is.na() function instead.

is.na(miss)
## [1] TRUE
is.na(x)
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is.na(x))
## [1] 1
!is.na(x)
##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
x[!is.na(x)]
##  [1]  7  6  8  7  6  7  8  6  4 10  6  7  7  6  8  7  5  4  7  1  6  3  2  3  5
## [26]  5  5  5  4

5.2 Character Vectors

In computer science, a “string” is a term used for variables that are made up of characters, and can include letters, words, phrases, or symbols. They are typically recognized by being wrapped in quotation marks.

Let’s think back to the days of early elementary school. You probably learned about the color wheel with red, blue, and yellow as the primary colors, with purple, green, and orange as secondary colors. But did you know that you where lied to!!! The actual color wheel is composed of magenta, yellow, and cyan, with red, green, and blue as secondary colors? Google it.

ANYway… let’s take these primary and secondary colors, and randomly sample them (like putting one of each color into a hat, drawing one out, recording the result, then putting the color back, mixing and repeating) 25 times.

# Define the basic colors to be sampled
cmyk <- c("magenta", "red", "yellow", "green", "cyan", "blue", "black")
# Print to check
cmyk
## [1] "magenta" "red"     "yellow"  "green"   "cyan"    "blue"    "black"
# Sample the colors, with replacement
colorful <- sample(cmyk, 25, replace = TRUE)
# Print to check
colorful
##  [1] "cyan"    "blue"    "black"   "red"     "yellow"  "green"   "cyan"   
##  [8] "blue"    "cyan"    "magenta" "black"   "blue"    "red"     "yellow" 
## [15] "yellow"  "red"     "blue"    "blue"    "blue"    "black"   "black"  
## [22] "red"     "cyan"    "blue"    "yellow"
# Check the variable's class
class(colorful)
## [1] "character"

Remember, applying a function to a variable does NOT alter the variable.

# PRINT the sorted colors
sort(colorful)
##  [1] "black"   "black"   "black"   "black"   "blue"    "blue"    "blue"   
##  [8] "blue"    "blue"    "blue"    "blue"    "cyan"    "cyan"    "cyan"   
## [15] "cyan"    "green"   "magenta" "red"     "red"     "red"     "red"    
## [22] "yellow"  "yellow"  "yellow"  "yellow"
# Yet the variable `colorful` is not itself sorted
colorful
##  [1] "cyan"    "blue"    "black"   "red"     "yellow"  "green"   "cyan"   
##  [8] "blue"    "cyan"    "magenta" "black"   "blue"    "red"     "yellow" 
## [15] "yellow"  "red"     "blue"    "blue"    "blue"    "black"   "black"  
## [22] "red"     "cyan"    "blue"    "yellow"

5.2.1 Factor variables

Factor variables are similar to a character variables, except they have their own unique ordering. These are often used in statistical experiments where we are dealing with categorical (or “nominal”) data.

This unique ordering can be illustrated with the colors we just used! Notice how the sort function automatically went to alphabetical sorting,whereas the color spectrum has its own meaningful ordering.

Let’s start by converting the variable from a character variable to a factor variable.

# Convert the variable by coercing to factor and over-writing the original
colorful_fctr <- as.factor(colorful)
class(colorful_fctr)
## [1] "factor"

Note below that when we print the factor version of the sampled colors, we can tell it is a factor variable because it ends with a new field, Levels:, which tells us the ordering for the vector.

colorful_fctr
##  [1] cyan    blue    black   red     yellow  green   cyan    blue    cyan   
## [10] magenta black   blue    red     yellow  yellow  red     blue    blue   
## [19] blue    black   black   red     cyan    blue    yellow 
## Levels: black blue cyan green magenta red yellow

We can see that, by default, the as.factor uses alphabetical order for the levels, so sorting the whole vector is also alphabetical.

levels(colorful_fctr)
## [1] "black"   "blue"    "cyan"    "green"   "magenta" "red"     "yellow"
sort(colorful_fctr)
##  [1] black   black   black   black   blue    blue    blue    blue    blue   
## [10] blue    blue    cyan    cyan    cyan    cyan    green   magenta red    
## [19] red     red     red     yellow  yellow  yellow  yellow 
## Levels: black blue cyan green magenta red yellow

If we want to choose our own custom ordering, we have to build our factor using the factor() function, and tell it what order we want to impose (in this case, the order given in cmyk).

# Print out `cmyk` just to look at the desired ordering
cmyk
## [1] "magenta" "red"     "yellow"  "green"   "cyan"    "blue"    "black"
# Create the factor variable
colorful_fctr <- 
    factor(colorful, 
           levels = cmyk)
# Check the class
class(colorful_fctr)
## [1] "factor"
# Print to check
colorful_fctr
##  [1] cyan    blue    black   red     yellow  green   cyan    blue    cyan   
## [10] magenta black   blue    red     yellow  yellow  red     blue    blue   
## [19] blue    black   black   red     cyan    blue    yellow 
## Levels: magenta red yellow green cyan blue black
# See the levels
levels(colorful_fctr)
## [1] "magenta" "red"     "yellow"  "green"   "cyan"    "blue"    "black"
# See how the order of these new levels influences how the data is sorted.
sort(colorful_fctr)
##  [1] magenta red     red     red     red     yellow  yellow  yellow  yellow 
## [10] green   cyan    cyan    cyan    cyan    blue    blue    blue    blue   
## [19] blue    blue    blue    black   black   black   black  
## Levels: magenta red yellow green cyan blue black

As you can see above, the result is now sorted according to the order WE defined!

We can also tabulate a factor OR character vector to see how often a value occurs within.

color_counts <- 
    table(colorful)
color_counts
## colorful
##   black    blue    cyan   green magenta     red  yellow 
##       4       7       4       1       1       4       4
color_counts <- 
    table(colorful_fctr)
color_counts
## colorful_fctr
## magenta     red  yellow   green    cyan    blue   black 
##       1       4       4       1       4       7       4

6 # Comments

It is difficult to understate the importance of comments in one’s code.

Code commenting is the practice of sprinkling short, typically single-line notes throughout your code. These notes are called comments. They explain how your program works, and your intentions behind it.

Comments don’t have any effect on your program, but they are invaluable for people reading your code. Here’s an example of code commenting in action:

# Any and all lines beginning with a hashtag will be
# readable by YOU, but will be completely ignored by R

# Note that the variable `x` will not be printed to the console

# x
y
## [1] 10
z
## [1] 10
n
## [1] 30

Please scroll up an see where I have and have not included comments in my code. Compared to code WITH comments, which do YOU think is easier to understand and follow??

Code comments are useful for several purposes. A code comment can:

  • Explain what a particular function does
  • Explain something which might not be obvious to the reader
  • Clarify your intention behind a certain line or block of code
  • Serve as a reminder to change something in the future.

6.0.1 Commenting out code

Code commenting also becomes invaluable when you want to ‘comment out’ code. This is when you turn a block of code into a comment because you don’t want that code to be run, but you still want to keep it on hand in case you want to re-add it. We will cover examples of this type in the future.


Congratulations! You’ve completed the Lab 1A. For grading purposes, please submit a screenshot of RStudio with the corrected value t_stat printed out in the console.