In this module, we will explore a set of helper functions in order to:
Let us look at a case study (e-commerce data) and see how we can use dplyr helper functions to answer questions we have about and to modify/transform the underlying data set.
Let us ensure that the data is sanitized by checking the sources of traffic and devices used to visit the site. We will use
distinct() to examine the values in the
Columns can be renamed using
rename(data, new_name = current_name)
rename(ecom, time_on_site = duration)
dplyr offers sampling functions which allow us to specify either the number or percentage of observations.
sample_n() allows sampling a specific number of observations.
sample_frac() allows a specific percentage of observations.
sample_frac(ecom, size = 0.7)
dplyr verbs always return a tibble. What if you want to extract a specific column or a bunch of rows but not as a tibble? Use
pull to extract columns either by name or position. It will return a vector.
Let us extract the first column from
ecom using column position instead of name.
You can use
- before the column position to indicate the position in reverse.
Let us now look at extracting rows using
Let us now look at the proportion or share of visits driven by different sources of traffic.
data %>% group_by(column_name) %>% tally()
ecom %>% group_by(referrer) %>% tally()
Let us look at how many conversions happen across different devices.
ecom %>% group_by(referrer, purchase) %>% tally()
ecom %>% group_by(referrer, purchase) %>% tally() %>% filter(purchase)
Another way to extract the above information is by using
count(ecom, referrer, purchase)
between() allows us to test if the values in a column lie between two specific values.
between(data, lower_value, upper_value)
ecom_sample %>% pull(n_pages) %>% between(5, 15)
case_when() is an alternative to
if else. It allows us to lay down the conditions clearly and makes the code more readable. It is used with
mutate(data, new_column = case_when( condition ~ value, condition ~ value, TRUE ~ value ) )
n_visit(the number of previous visits).
ecom %>% mutate( repeat_visit = case_when( n_visit > 0 ~ TRUE, TRUE ~ FALSE ) ) %>% select(n_visit, repeat_visit)