dplyr Part 3

Introduction

In this module, we will explore a set of helper functions in order to:

  • extract unique rows
  • rename columns
  • sample data
  • extract columns
  • slice rows
  • arrange rows
  • compare tables
  • extract/mutate data using predicate functions
  • count observations for different levels of a variable

Case Study


Let us look at a case study (e-commerce data) and see how we can use dplyr helper functions to answer questions we have about and to modify/transform the underlying data set.

Data


ecom

Data Dictionary


  • referrer: referrer website/search engine
  • device: device used to visit the website
  • bouncers: whether a visit bounced (exited from landing page)
  • duration: time spent on the website (in seconds)
  • purchase: whether visitor purchased
  • n_visit: number of visits
  • n_pages: number of pages visited/browsed

Data Sanitization


Traffic Sources


Let us ensure that the data is sanitized by checking the sources of traffic and devices used to visit the site. We will use distinct() to examine the values in the referrer column.

Syntax


distinct(ecom, referrer)

Instructions


distinct(ecom, device)

Rename Columns


Rename Columns


Columns can be renamed using rename().

Syntax


rename(data, new_name = current_name)

Instructions


rename(ecom, time_on_site = duration)

Sampling Data


Sampling Data


dplyr offers sampling functions which allow us to specify either the number or percentage of observations. sample_n() allows sampling a specific number of observations.

Syntax


sample_n(ecom, sample_size)

Instructions


sample_n(ecom, 700)

Sampling Data


sample_frac() allows a specific percentage of observations.

Syntax


sample_frac(ecom, sample_percentage)

Instructions


sample_frac(ecom, size = 0.7)

Extract Columns


Extract Columns


dplyr verbs always return a tibble. What if you want to extract a specific column or a bunch of rows but not as a tibble? Use pull to extract columns either by name or position. It will return a vector.

Syntax


pull(data, column_name)

Instructions


pull(ecom_mini, device)

Extract Columns


Let us extract the first column from ecom using column position instead of name.

Syntax


pull(data, column_position)

Instructions


pull(ecom_mini, 1) 

Extract Columns


You can use - before the column position to indicate the position in reverse.

Syntax


pull(data, -column_position)

Instructions


pull(ecom_mini, -1)

Extract Rows


Extract Rows


Let us now look at extracting rows using slice().

Syntax


slice(data, row_numbers)

Instructions


slice(ecom, 1:20)

Instructions


slice(ecom, n())

Tabulate Data


Tabulate Data


Let us now look at the proportion or share of visits driven by different sources of traffic.

Syntax


data %>%
  group_by(column_name) %>%
  tally()

Instructions


ecom %>%
  group_by(referrer) %>%
  tally()

Tabulate Data


Let us look at how many conversions happen across different devices.

Instructions


ecom %>%
  group_by(referrer, purchase) %>%
  tally()

Tabulate Data


Instructions


ecom %>%
  group_by(referrer, purchase) %>%
  tally() %>%
  filter(purchase)

Count


Another way to extract the above information is by using count

Syntax


count(data, column_name)

Instructions


count(ecom, referrer, purchase)

Top Referrers


Between


between() allows us to test if the values in a column lie between two specific values.

Syntax


between(data, lower_value, upper_value)

Instructions


  • check how many visits browsed pages between 5 and 15
ecom_sample %>%
  pull(n_pages) %>%
  between(5, 15) 

Case When


case_when() is an alternative to if else. It allows us to lay down the conditions clearly and makes the code more readable. It is used with mutate() .

Syntax


mutate(data,
  new_column = case_when(
    condition ~ value,
    condition ~ value,
    TRUE ~ value
  )
)

Instructions


  • create a new column repeat_visit from n_visit (the number of previous visits).
ecom %>%
  mutate(
    repeat_visit = case_when(
      n_visit > 0 ~ TRUE,
      TRUE ~ FALSE
    )
  ) %>%
  select(n_visit, repeat_visit)