Data Processing

Isaac Quintanilla Salinas

UC Riverside

4/21/2022

Presentation Online

Presentation:

www.inqs.info/files/hiss_3/hiss_3.html

RMD:

www.inqs.info/files/hiss_3/hiss_3.qmd

Website:

www.inqs.info

Email:

iquin002@ucr.edu

Data Cleaning

dplyr

Known as the Grammar of Data Manipulation
dplyr.tidyverse.org

dplyr Functions

mutate() adds new variables
select() selects variables
filter() filters data
if_else() conditional function that returns 2 values
group_by() a dataset is grouped by factors
summarise() provides summaries of data

tidyr

Used to create tidy data
tidyr.tidyverse.org

tidyr Functions

pivot_longer() (formerly gather()) transforms the data from wide to long
pivot_wider() (formerly spread()) transforms the data from long to wide
separate() separates a one variable to multiple variables
unite() merge multiple variable to one variable

Pipe Operator `%>%`

The pipe operator is the real power of tidyverse.
It takes the output of a function and uses it as input for another function.
Tidyverse works best when data frames (tibbles) are used a inputs.

Data Set

We will work on manipulating the mtcars data set
Below prints out the code:

mtcars %>% 
  head(n=3)

               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

`mutate()`

Adds a new variable to a data frame
Example:

mtcars %>% 
  mutate(log_mpg=log(mpg)) %>% 
  head(n=3)

               mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761

`mutate()`

Each argument adds a new variable added
Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  head(n=3)

               mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
                log_hp
Mazda RX4     4.700480
Mazda RX4 Wag 4.700480
Datsun 710    4.532599

`select()`

-This selects the variables to keep in the data frame

-Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>% 
  head(n=3)

               mpg  log_mpg  hp   log_hp
Mazda RX4     21.0 3.044522 110 4.700480
Mazda RX4 Wag 21.0 3.044522 110 4.700480
Datsun 710    22.8 3.126761  93 4.532599

`filter()`

Selects observations that satisfy a condition
Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>% 
  head(n=3)

               mpg  log_mpg  hp   log_hp
Mazda RX4     21.0 3.044522 110 4.700480
Mazda RX4 Wag 21.0 3.044522 110 4.700480
Datsun 710    22.8 3.126761  93 4.532599

`if_else()`

A function that provides T (1) if the condition is met and F (0) otherwise
Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  head(n=3)

               mpg  log_mpg  hp   log_hp hilhp
Mazda RX4     21.0 3.044522 110 4.700480     1
Mazda RX4 Wag 21.0 3.044522 110 4.700480     1
Datsun 710    22.8 3.126761  93 4.532599     1

`group_by()`

This groups the data frame
Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  group_by(hilhp) %>% 
  head(n=3)

# A tibble: 3 × 5
# Groups:   hilhp [1]
    mpg log_mpg    hp log_hp hilhp
  <dbl>   <dbl> <dbl>  <dbl> <dbl>
1  21      3.04   110   4.70     1
2  21      3.04   110   4.70     1
3  22.8    3.13    93   4.53     1

`summarise()`

Creates summary statistics for variables

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  group_by(hilhp) %>%
  summarise(mean_mpg=mean(mpg),mean_lmpg=mean(log_mpg),
            sd_mpg=sd(mpg),sd_lmpg=sd(log_mpg)) %>%
  head(n=3)

# A tibble: 2 × 5
  hilhp mean_mpg mean_lmpg sd_mpg sd_lmpg
  <dbl>    <dbl>     <dbl>  <dbl>   <dbl>
1     0     29.7      3.38   3.85   0.133
2     1     22.0      3.08   3.46   0.148

Wide to Long Example

Wide to Long Data Example

We work on converting data from wide to long using the functions in the tidyr package. For many statistical analysis, long data is necessary.

Load Data

Use the read_csv() to read data_3_4.csv into an object called data1;

data1 <- read_csv(file="http://www.inqs.info/files/hiss_3/data_3_4.csv")

Wide Data

 [1] "ID1"       "v1/mean"   "v1/sd"     "v1/median" "v2/mean"   "v2/sd"    
 [7] "v2/median" "v3/mean"   "v3/sd"     "v3/median" "v4/mean"   "v4/sd"    
[13] "v4/median"

# A tibble: 6 × 13
  ID1   v1/me…¹ `v1/sd` v1/me…² v2/me…³ `v2/sd` v2/med…⁴ v3/me…⁵ `v3/sd` v3/me…⁶
  <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
1 Ad91…   3.11    2.86     4.50   1.93    3.21   3.27       2.65  -0.383    3.23
2 A9c5…   2.03    2.90     2.08   0.709   2.27   4.13       1.45   2.01     2.84
3 A28a…  -0.415   2.42     2.47   2.38   -0.820  1.22       3.44   1.63     2.10
4 Aaf5…   1.25    2.24     3.71   4.00    0.456  4.32       1.54   0.789    4.08
5 A370…  -0.984   0.972    3.73   2.19   -0.184  2.14       4.32  -0.804    5.38
6 Aea9…   1.42    1.34     2.35   2.77    4.16  -0.00874   -3.02   4.25     6.36
# … with 3 more variables: `v4/mean` <dbl>, `v4/sd` <dbl>, `v4/median` <dbl>,
#   and abbreviated variable names ¹`v1/mean`, ²`v1/median`, ³`v2/mean`,
#   ⁴`v2/median`, ⁵`v3/mean`, ⁶`v3/median`

Long Data

# A tibble: 10 × 5
   ID1       time    mean     sd  median
   <chr>     <chr>  <dbl>  <dbl>   <dbl>
 1 Ad9131ee9 v1     3.11   2.86   4.50  
 2 Ad9131ee9 v2     1.93   3.21   3.27  
 3 Ad9131ee9 v3     2.65  -0.383  3.23  
 4 Ad9131ee9 v4     0.605  0.883  4.65  
 5 A9c5988ea v1     2.03   2.90   2.08  
 6 A9c5988ea v2     0.709  2.27   4.13  
 7 A9c5988ea v3     1.45   2.01   2.84  
 8 A9c5988ea v4     0.710  3.03  -0.0898
 9 A28a5479d v1    -0.415  2.42   2.47  
10 A28a5479d v2     2.38  -0.820  1.22

`pivot_longer()`

The pivot_longer() function grabs the variables that repeated in an observation places them in one variable:

data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  head()

# A tibble: 6 × 3
  ID1       measurement value
  <chr>     <chr>       <dbl>
1 Ad9131ee9 v1/mean      3.11
2 Ad9131ee9 v1/sd        2.86
3 Ad9131ee9 v1/median    4.50
4 Ad9131ee9 v2/mean      1.93
5 Ad9131ee9 v2/sd        3.21
6 Ad9131ee9 v2/median    3.27

`separate()`

The separate() function will separate a variable to multiple variables:

data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  separate(col=measurement,into=c("time","stat"),sep="/") %>% 
  head()

# A tibble: 6 × 4
  ID1       time  stat   value
  <chr>     <chr> <chr>  <dbl>
1 Ad9131ee9 v1    mean    3.11
2 Ad9131ee9 v1    sd      2.86
3 Ad9131ee9 v1    median  4.50
4 Ad9131ee9 v2    mean    1.93
5 Ad9131ee9 v2    sd      3.21
6 Ad9131ee9 v2    median  3.27

`pivot_wider()`

The pivot_wider() function then converts long data to wide data.

data1 %>% 
  pivot_longer(`v1/mean`:`v4/median`,"measurement","value") %>% 
  separate(measurement,c("time","stat"),sep="/") %>% 
  pivot_wider(names_from = stat,values_from = value) %>% 
  head()

# A tibble: 6 × 5
  ID1       time   mean     sd median
  <chr>     <chr> <dbl>  <dbl>  <dbl>
1 Ad9131ee9 v1    3.11   2.86    4.50
2 Ad9131ee9 v2    1.93   3.21    3.27
3 Ad9131ee9 v3    2.65  -0.383   3.23
4 Ad9131ee9 v4    0.605  0.883   4.65
5 A9c5988ea v1    2.03   2.90    2.08
6 A9c5988ea v2    0.709  2.27    4.13

Graphics

ggplot2

Known as the Grammar of Graphics
ggplot2.tidyverse.org

Basics

ggplot2 creates a plot by layering graphical elements on top of a plot
A base plot is created with the data
- The data must be a data frame or tibble
Additional layers are added to base plot with + sign

Using ggplot2

Create Base Plot
Add geometrical Elements
Customize Plot
Google

Base Plot

A base plot is created using ggplot2()
- data: specifies data frame to construct the base plot
- mapping: specifies the aesthetic mapping for the plot
  - aes(): creates the mapping function

base_plot <- ggplot(mtcars, aes(x=mpg))

Base Plot

base_plot

Univariate

Histograms
- geom_histogram()
Density Plots
- geom_density()
qq plot
- geom_qq()
- geom_qq_line()

Histograms

base_plot + geom_histogram()

Density Plot

base_plot + geom_density()

QQ Plot

ggplot(mtcars, aes(sample = mpg)) + 
  geom_qq() + 
  geom_qq_line()

Bivariate

Scatter Plot
- geom_point()
Line Plot
- geom_line()

Bivariate Base Plot

base_plot2 <- ggplot(mtcars, aes(x=mpg, y = hp))
base_plot2

Scatter Plot

base_plot2 + geom_point()

Line Plot

base_plot2 + geom_line()

Line & Scatter Plot

base_plot2 + 
  geom_point() +
  geom_line()

Special Cases

Bivariate

Heat Map
- geom_bin2d()
Contour Map
- geom_density_2d()

Trivariate

Heat Map
- geom_contour_filled()
Contour Map
- geom_contour()

Heat Map

base_plot2 + geom_bin2d()

Contour Map

base_plot2 + 
  geom_density2d()

Trend Lines

Regression Line
- geom_smooth(method = "lm")
LOESS
- geom_smooth()

Regression Line

base_plot2 + 
  geom_point() +
  geom_smooth(method = "lm")

LOESS Line

base_plot2 + 
  geom_point() +
  geom_smooth()

Grouping Plots

Faceting: Facet allows you to subset the data by a categorical variable
- facet_grid()
- facet_wrap()
Grouping can be done within the mapping function: aes()
- color
- group
- shape

Mapping

ggplot(mtcars, aes(x = mpg, y = hp, col = factor(cyl))) +
  geom_point()

Customization

Title
- ggtitle()
Labels
- X Label: xlab()
- Y Label: ylab()

Themes

The theme() function allows you to change any component in the plot
ggplot2 has several prebuilt themes:
theme_bw()
theme_void()
Legends can be adjusted using the scale_XX_YY()
XX: the type grouping factor
YY: the type variable

Advanced Example

Base Plot
Scatter Plot
Add Regression Line
Split The Plot
Change the Labels
Adjust the Legend
Change the theme

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs)))

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm")

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic")))

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) +
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") +
  ylab("Horse Power")

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) + 
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") + 
  ylab("Horse Power") +
  scale_color_discrete(
    labels = c("V-Shaped", "Straight"),
    name = "")

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) + 
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") + 
  ylab("Horse Power") +
  scale_color_discrete(
    labels = c("V-Shaped", "Straight"),
    name = "") +
  theme_bw()

Final Thoughts

Google is your friend!
Practice!
Read the documentation!
Utilize Cheatsheets!

Data Processing

Presentation Online

Data Cleaning

dplyr

dplyr Functions

tidyr

tidyr Functions

Pipe Operator %>%

Data Set

mutate()

mutate()

select()

filter()

if_else()

group_by()

summarise()

Wide to Long Example

Wide to Long Data Example

Load Data

Wide Data

Long Data

pivot_longer()

separate()

pivot_wider()

Graphics

ggplot2

Basics

Using ggplot2

Base Plot

Base Plot

Univariate

Histograms

Density Plot

QQ Plot

Bivariate

Bivariate Base Plot

Scatter Plot

Line Plot

Line & Scatter Plot

Special Cases

Bivariate

Trivariate

Heat Map

Contour Map

Trend Lines

Regression Line

LOESS Line

Grouping Plots

Facet

Mapping

Customization

Themes

Advanced Example

Advanced Example

Plot Code

Plot Code

Plot Code

Plot Code

Plot Code

Plot Code

Plot Code

Final Thoughts

Resources

Pipe Operator `%>%`

`mutate()`

`mutate()`

`select()`

`filter()`

`if_else()`

`group_by()`

`summarise()`

`pivot_longer()`

`separate()`

`pivot_wider()`