Instructions

The goal of this assignment is to continue our exploration of matrices and data frames. You will need the following files for this lab: ca_pop.csv, ca_edu.xlsx, ca_med_inc.sav. You will need the following R Packages: palmerpenguins and rmarkdown. Please use the RMD file FIRST_LAST_HW_02.Rmd to answer your questions. Delete code or answer chunks as necessary. You will submit both the RMD file and the corresponding HTML file from the knitted document.

Note: you may need to search the internet to find answers for some of the questions.

You will be graded as follows:

  • Does your R chunks run (some errors are acceptable in this assignment)?
  • Have you completed the assignment in its entirety?
  • Have you followed the instructions carefully?
  • Have you responded to the questions correctly?

Grading

  • (20 pts) Have you completed the assignment in its entirety?

  • (20 pts) Are your responses correct for the subset of randomly graded questions?

  • (5 pts) CODE DOCUMENTATION. Add comments to nearly every line explaining what the line of code is doing.

  • (5 pts) Submitted Knitted Document


Questions

Part 1

Q1

Part A

Is to be completed in Lab 2A. Below is a description of what is necessary from the lab.

  • Load the following data sets in R: ca_pop.csv, ca_edu.xlsx, ca_med_inc.sav.

  • Merge the data frames into one giant data frame.

  • Creating an indicator variable for counties located in Southern California.

socal <- c("San Luis Obispo County, California", "Kern County, California",
           "San Bernardino County, California", "Santa Barbara County, California",
           "Ventura County, California","Los Angeles County, California",
           "Orange County, California", "Riverside County, California",
           "San Diego County, California", "Imperial County, California")

Part B

1.

Create a new variable indicating how many individuals have received a Bachelor’s Degree for each county.

2.

How many individuals have a Bachelor’s degree in Southern California?

3.

What is the average proportion of individuals in Southern California who have a Bachelor’s Degree?

4.

The correlation measures the association between numeric variables. It can have a number between -1 and 1. The closer the value is to 0, the stronger evidence there is no association. The closer the value is to 1/-1, the stronger evidence of a positive/negative association.

You can find the correlation between two vectors using the cor function. You will only need to specify the two variables separated by a comma in the function.

Find the correlation between median income and Proportion of Bachelor’s Degree in each county.

5.

What is the correlation between median income and proportion of Bachelor’s degree in Southern California.

Part 2

Q2

Recreate the matrices below. Please use the same names as those shown.

sqr_even_mat
##      [,1] [,2] [,3] [,4]
## [1,]    2    4    6    8
## [2,]   10   12   14   16
## [3,]   18   20   22   24
## [4,]   26   28   30   32
two_col_mat
##      [,1] [,2]
## [1,]    8    6
## [2,]   16   14
## [3,]   24   22
## [4,]   32   30

Q3

Multiply the two matrices two_col_mat and sqr_even_mat. Store the result in a variable named prod_mat. Print the matrix to check.

Q4

Let’s return to the two original matrices, sqr_even_mat and two_col_mat. Bind the columns together, resulting in a matrix with 6 columns. Place sqr_even_mat on the left and two_col_mat on the right. Store the result as six_col_mat

Q5

Next we want to bind the rows. What is wrong with the following code?

rbind(sqr_even_mat, two_col_mat)
## Error in rbind(sqr_even_mat, two_col_mat): number of columns of matrices must match (see arg 2)

Part 3

The next questions will involve the penguins data set from the palmerpenguins package.

Q6

For each penguin species, find the mean bill_length_mm. Remember to eliminate the missing observations.

Q7

For each species and sex, find the mean bill_length_mm.

Q8

Create a two new variables indicating if the penguin is in the top \(50^{th}\) percentile (including \(50^{th}\)) for the variables bill_length_mm and bill_depth_mm . Generate a \(2\times 2\) contingency table.

Q9

Using the new variables, find the mean body_mass_g for each \(50^{th}\) percentile indicator variable bill_length_mm and bill_depth_mm.

Q10

Find the ratio of the standardized bill_length_mm and standardized bill_depth_mm for each island: \(\frac{std(bill\_length\_mm)}{std(bill\_depth\_mm)}\). To standardize a variable, use the z-score formula: \[ z = \frac{x-mean(x)}{sd(x)}. \]

Part 4

Q11

The readr package has two separate functions to read data that are separated by white space (space, tab, etc…): read_table and read_table2. Read the help documentation for the functions and explain why you may want to use one function over the other function.

Q12

Both lapply and sapply functions belong to the same family of *apply functions where a user-specified function is applied to an R object. What is the difference between the sapply and lapply function.

Q13

In linear algebra, the Kronecker product creates block matrices from two matrices. The Kronecker product multiplies each element of the first matrix the entire second matrix. What is the function in R to conduct the Kronecker product?

Q14

What function do you use to view what files are in your working directory?

Q15

Fix the code below:

tapply(penguins$bill_length_mm, list(penguins$island), mean)
##    Biscoe     Dream Torgersen 
##        NA  44.16774        NA