The goal of this assignment is to continue our exploration of matrices and data frames. You will need the following files for this lab: ca_pop.csv
, ca_edu.xlsx
, ca_med_inc.sav
. You will need the following R Packages: palmerpenguins
and rmarkdown
. Please use the RMD file FIRST_LAST_HW_02.Rmd
to answer your questions. Delete code or answer chunks as necessary. You will submit both the RMD file and the corresponding HTML file from the knitted document.
Note: you may need to search the internet to find answers for some of the questions.
You will be graded as follows:
(20 pts) Have you completed the assignment in its entirety?
(20 pts) Are your responses correct for the subset of randomly graded questions?
(5 pts) CODE DOCUMENTATION. Add comments to nearly every line explaining what the line of code is doing.
(5 pts) Submitted Knitted Document
Is to be completed in Lab 2A. Below is a description of what is necessary from the lab.
Load the following data sets in R: ca_pop.csv
, ca_edu.xlsx
, ca_med_inc.sav
.
Merge the data frames into one giant data frame.
Creating an indicator variable for counties located in Southern California.
socal <- c("San Luis Obispo County, California", "Kern County, California",
"San Bernardino County, California", "Santa Barbara County, California",
"Ventura County, California","Los Angeles County, California",
"Orange County, California", "Riverside County, California",
"San Diego County, California", "Imperial County, California")
Create a new variable indicating how many individuals have received a Bachelor’s Degree for each county.
How many individuals have a Bachelor’s degree in Southern California?
What is the average proportion of individuals in Southern California who have a Bachelor’s Degree?
The correlation measures the association between numeric variables. It can have a number between -1 and 1. The closer the value is to 0, the stronger evidence there is no association. The closer the value is to 1/-1, the stronger evidence of a positive/negative association.
You can find the correlation between two vectors using the cor
function. You will only need to specify the two variables separated by a comma in the function.
Find the correlation between median income and Proportion of Bachelor’s Degree in each county.
What is the correlation between median income and proportion of Bachelor’s degree in Southern California.
Recreate the matrices below. Please use the same names as those shown.
sqr_even_mat
## [,1] [,2] [,3] [,4]
## [1,] 2 4 6 8
## [2,] 10 12 14 16
## [3,] 18 20 22 24
## [4,] 26 28 30 32
two_col_mat
## [,1] [,2]
## [1,] 8 6
## [2,] 16 14
## [3,] 24 22
## [4,] 32 30
Multiply the two matrices two_col_mat
and sqr_even_mat
. Store the result in a variable named prod_mat
. Print the matrix to check.
Let’s return to the two original matrices, sqr_even_mat
and two_col_mat
. Bind the columns together, resulting in a matrix with 6 columns. Place sqr_even_mat
on the left and two_col_mat
on the right. Store the result as six_col_mat
Next we want to bind the rows. What is wrong with the following code?
rbind(sqr_even_mat, two_col_mat)
## Error in rbind(sqr_even_mat, two_col_mat): number of columns of matrices must match (see arg 2)
The next questions will involve the penguins
data set from the palmerpenguins
package.
For each penguin species, find the mean bill_length_mm
. Remember to eliminate the missing observations.
For each species and sex, find the mean bill_length_mm
.
Create a two new variables indicating if the penguin is in the top \(50^{th}\) percentile (including \(50^{th}\)) for the variables bill_length_mm
and bill_depth_mm
. Generate a \(2\times 2\) contingency table.
Using the new variables, find the mean body_mass_g
for each \(50^{th}\) percentile indicator variable bill_length_mm
and bill_depth_mm
.
Find the ratio of the standardized bill_length_mm
and standardized bill_depth_mm
for each island: \(\frac{std(bill\_length\_mm)}{std(bill\_depth\_mm)}\). To standardize a variable, use the z-score formula: \[
z = \frac{x-mean(x)}{sd(x)}.
\]
The readr
package has two separate functions to read data that are separated by white space (space, tab, etc…): read_table
and read_table2
. Read the help documentation for the functions and explain why you may want to use one function over the other function.
Both lapply
and sapply
functions belong to the same family of *apply
functions where a user-specified function is applied to an R object. What is the difference between the sapply
and lapply
function.
In linear algebra, the Kronecker product creates block matrices from two matrices. The Kronecker product multiplies each element of the first matrix the entire second matrix. What is the function in R to conduct the Kronecker product?
What function do you use to view what files are in your working directory?
Fix the code below:
tapply(penguins$bill_length_mm, list(penguins$island), mean)
## Biscoe Dream Torgersen
## NA 44.16774 NA