dplyr Verbs

## Introduction

According to a survey by CrowdFlower, data scientists spend most of their time cleaning and manipulating data rather than mining or modeling them for insights. As such, it becomes important to have tools that make data manipulation faster and easier. In this tutorial, we introduce you to dplyr, a grammar of data manipulation.

## dplyr Verbs

dplyr provides a set of verbs that help us solve the most common data manipulation challenges while working with tabular data (dataframes, tibbles):

• select: returns subset of columns
• filter: returns a subset of rows
• arrange: re-order or arrange rows according to single/multiple variables
• mutate: create new columns from existing columns
• summarise: reduce data to a single summary

## Case Study

We will explore a dummy data set that we have created to resemble web logs of an online retail company. The data can be downloaded from here. We will use dplyr to answer the following questions:

• what is the average order value by device types?
• what is the average number of pages visited by purchasers and non-purchasers?
• what is the average time on site for purchasers vs non-purchasers?
• what is the average number of pages visited by purchasers and non-purchasers using mobile?

## Data

• click on Run Code to view the data
ecom

## Data Dictionary

Below is the description of the data set:

• id: row id
• referrer: referrer website/search engine
• os: operating system
• browser: browser
• device: device used to visit the website
• n_pages: number of pages visited
• duration: time spent on the website (in seconds)
• repeat: frequency of visits
• country: country of origin
• purchase: whether visitor purchased
• order_value: order value of visitor (in dollars)

## Average Order Value

What is the average order value by device types?

Average value of every order placed over a defined period of time. It is determined using sales per order and not sales per customer. Let us look at the steps to calculate AOV from the ecommerce data set.

## AOV Computation

• Step 1: Filter Purchasers
• Step 2: Select data related to AOV (order value, order items) and device
• Step 3: Group order value and order items by device
• Step 4: Compute total order value and order items for each device
• Step 5: Compute AOV for each device
• Step 6: Select device and AOV data

## AOV by Devices

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) %>%
group_by(device) %>%
summarise_all(funs(sum)) %>%
mutate(
aov = order_value / order_items
) %>%
select(device, aov)

## Filter - Intro

In order to compute the AOV, we must first separate the purchasers from non-purchasers. We will do this by filtering the data related to purchasers using the filter() function. It allows us to filter rows that meet a specific criteria/condition. The first argument is the name of the data frame and the rest of the arguments are expressions for filtering the data. Let us look at a few examples:

## Filter - Example 1

filter(ecom,  == )
filter(ecom, device == "mobile")

## Filter - Example 2

filter(ecom, device == "mobile",      ==     )
filter(ecom, device == "mobile", purchase == "true")

## Filter - Practice

### Instructions

filter(ecom, device == "mobile",       )
filter(ecom, device == "mobile", n_pages > 5)

## Filter - Case Study

### Instructions

filter(ecom,            )
filter(ecom, purchase == "true")

## Select - Intro

After filtering the data, we need to select relevent variables to compute the AOV. Remember, we do not need all the columns in the data to compute a required metric (in our case, AOV). The select() function allows us to select a subset of columns. The first argument is the name of the data frame and the subsequent arguments specify the columns by name or position. Let us look at a few examples:

## Select - Example 1

select(ecom,     ,    )
select(ecom, device, purchase)

## Select - Example 2

select(ecom,         )
select(ecom, device:purchase)

## Select - Example 3

select(ecom,    ,     )
select(ecom, -id, -country)

## Select - Case Study

For our case study, we need to select the columns order value and order items to calculate the AOV. We also need to select the device column as we are computing the AOV for different devices.

### Instructions

select(ecom, device, order_value, order_items)

## Select - Case Study

But we want the above data only for purchasers. We will combine filter() and select() functions to extract data related to purchasers.

### Instructions

ecom1 <- filter(ecom,       )
ecom2 <- select(ecom1,      )
ecom2
ecom1 <- filter(ecom, purchase == "true")
ecom2 <- select(ecom1, device, order_value, order_items)
ecom2

## Group Data - Intro

Since we want to compute the AOV for each device, we need to compute the total order value and total order items for each device. To achieve this, we will group the selected variables by device type. Using the group_by() function, we will group our case study data by device types. The first argument is the name of the data frame and the second argument is the name of the column based on which the data will be split. Let us look at a few examples:

## Group Data - Example 1

group_by(ecom,     )
group_by(ecom, referrer)

## Group Data - Case Study

### Instructions

In the second line in the previous output, you can observe Groups: referrer [5]. The data is split into 5 groups as the referrer variable has 5 distinct values. For our case study, we need to group the data by device type.

ecom3 <- group_by(ecom2,      )
ecom3
ecom3 <- group_by(ecom2, device)
ecom3

## Summarize - Intro

The next step is to compute the total order value and total order items for each device. We will use them to then compute the average order value. Now we need to reduce the order value and order items data to a single summary. We can achieve this using the summarise() function. The first argument is the name of a data frame and the subsequent arguments are functions that can generate a summary. For example, we can use min, max, sum, mean etc.

## Summarise - Practice 1

### Instructions

For our case study, we need the totals of order value and order items. What function can we use to obtain them? The sum() function will generate the sum of the values and hence we will use it inside the summarise() function. Remember, we need to provide a name to the summary being generated.

ecom4 <- summarise(ecom3, total_value = sum(order_value),
total_items = sum(order_items))
ecom4

There you go, we have the total order value and total order items for each device type. Another way to achieve the above result is to use the summarise_all() function. How does that work? It generates the specified summary for all the columns in the data set except for the column based on which the data has been grouped. So we need to ensure that the data frame does not have any irrelevant columns.

## Summarize: Case Study

### Instructions

In our case study, we have split the data based on the device type and we have 2 other columns which are order value and order items. If we use summarise_all() function, it will generate the summary for the two columns based on the function specified. To specify the functions, we need to use another argument funs and it can take any number of valid functions.

ecom4 <- summarise_all(ecom3, funs(sum))
ecom4

## Mutate - Intro

Now that we have the total order value and total order items for each device category, we can compute the AOV. We will create a new column to store the result. To create a new column, we will use the mutate() function. The first argument is the name of the data frame and the subsequent arguments are expressions for creating new columns based out of existing columns.

## Mutate - Case Study

### Instructions

ecom5 <- mutate(ecom4, aov =                          )
ecom5
ecom5 <- mutate(ecom4, aov = order_value / order_items)
ecom5

## AOV by Devices

Let us combine all the code from the previous steps:

ecom1 <- filter(ecom, purchase == "true")
ecom2 <- select(ecom1, device, order_value, order_items)
ecom3 <- group_by(ecom2, device)
ecom4 <- summarise_all(ecom3, funs(sum))
ecom5 <- mutate(ecom4, aov = order_value / order_items)
ecom6 <- select(ecom5, device, aov)
ecom6

## Using Pipe - Filter

ecom %>%
filter(purchase == 'true') 

## Using Pipe - Select

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) 

## Using Pipe - Group Data

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) %>%
group_by(device) 

## Using Pipe - Summarize

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) %>%
group_by(device) %>%
summarise_all(funs(sum)) 

## Using Pipe - Mutate

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) %>%
group_by(device) %>%
summarise_all(funs(sum)) %>%
mutate(
aov = order_value / order_items
) 

## Putting it all together..

If you observe, at each step we create a new variable(data frame) and then use it as an input in the next step i.e. the output from one function becomes the input for another function. Can we achieve the final outcome i.e. ecom6 without creating the intermediate data frames (ecom1 - ecom5)? Yes, we can. We will use the %>% operator to chain the above steps so that we can avoid creating the intermediate data frames. Let us see how to do that.

ecom %>%
filter(purchase == 'true') %>%
select(device, order_value, order_items) %>%
group_by(device) %>%
summarise_all(funs(sum)) %>%
mutate(
aov = order_value / order_items
) %>%
select(device, aov)

In the above code, we take the output from each step and use it as an input for the next step using the pipe %>% operator. It reduces the intermediate data frames and makes the code readable. Take the ecom data frame -> filter the purchasers -> select device, order_value and order_items variables -> group the resulting data frame by device type -> compute the sum of all the variables in the grouped data frames -> compute the average order value -> select device type and aov

It is the same as the steps we wrote at the beginning of this module. Now you realize how powerful dplyr is along with the %>% operator. There are other functions in dplyr but we will cover them in another module. You are encouraged to use the above approach to answer the questions we have listed below.

## Practice Questions

• what is the average number of pages visited by purchasers and non-purchasers?
• what is the average time on site for purchasers vs non-purchasers?
• what is the average number of pages visited by purchasers and non-purchasers using mobile?