dplyr Verbs

Introduction


According to a survey by CrowdFlower, data scientists spend most of their time cleaning and manipulating data rather than mining or modeling them for insights. As such, it becomes important to have tools that make data manipulation faster and easier. In this tutorial, we introduce you to dplyr, a grammar of data manipulation.

dplyr Verbs


dplyr provides a set of verbs that help us solve the most common data manipulation challenges while working with tabular data (dataframes, tibbles):

  • select: returns subset of columns
  • filter: returns a subset of rows
  • arrange: re-order or arrange rows according to single/multiple variables
  • mutate: create new columns from existing columns
  • summarise: reduce data to a single summary

Case Study


We will explore a dummy data set that we have created to resemble web logs of an online retail company. The data can be downloaded from here. We will use dplyr to answer the following questions:

  • what is the average order value by device types?
  • what is the average number of pages visited by purchasers and non-purchasers?
  • what is the average time on site for purchasers vs non-purchasers?
  • what is the average number of pages visited by purchasers and non-purchasers using mobile?

Data


  • click on Run Code to view the data
ecom

Data Dictionary


Below is the description of the data set:

  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)

Average Order Value


What is the average order value by device types?

Average value of every order placed over a defined period of time. It is determined using sales per order and not sales per customer. Let us look at the steps to calculate AOV from the ecommerce data set.

AOV Computation


  • Step 1: Filter Purchasers
  • Step 2: Select data related to AOV (order value, order items) and device
  • Step 3: Group order value and order items by device
  • Step 4: Compute total order value and order items for each device
  • Step 5: Compute AOV for each device
  • Step 6: Select device and AOV data

AOV by Devices


ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) %>%
  group_by(device) %>%
  summarise_all(funs(sum)) %>%
  mutate(
    aov = order_value / order_items
  ) %>%
  select(device, aov)

Filter - Intro


In order to compute the AOV, we must first separate the purchasers from non-purchasers. We will do this by filtering the data related to purchasers using the filter() function. It allows us to filter rows that meet a specific criteria/condition. The first argument is the name of the data frame and the rest of the arguments are expressions for filtering the data. Let us look at a few examples:

Filter - Example 1


filter(ecom,  == )
filter(ecom, device == "mobile")

Filter - Example 2


filter(ecom, device == "mobile",      ==     )
filter(ecom, device == "mobile", purchase == "true")

Filter - Practice

Instructions


filter(ecom, device == "mobile",       )
filter(ecom, device == "mobile", n_pages > 5)

Filter - Case Study


Instructions


filter(ecom,            )
filter(ecom, purchase == "true")

Select - Intro


After filtering the data, we need to select relevent variables to compute the AOV. Remember, we do not need all the columns in the data to compute a required metric (in our case, AOV). The select() function allows us to select a subset of columns. The first argument is the name of the data frame and the subsequent arguments specify the columns by name or position. Let us look at a few examples:

Select - Example 1


select(ecom,     ,    )
select(ecom, device, purchase)

Select - Example 2


select(ecom,         )
select(ecom, device:purchase)

Select - Example 3


select(ecom,    ,     )
select(ecom, -id, -country)

Select - Case Study


For our case study, we need to select the columns order value and order items to calculate the AOV. We also need to select the device column as we are computing the AOV for different devices.

Instructions


select(ecom, device, order_value, order_items)

Select - Case Study


But we want the above data only for purchasers. We will combine filter() and select() functions to extract data related to purchasers.

Instructions


ecom1 <- filter(ecom,       )
ecom2 <- select(ecom1,      )
ecom2
ecom1 <- filter(ecom, purchase == "true")
ecom2 <- select(ecom1, device, order_value, order_items)
ecom2

Group Data - Intro


Since we want to compute the AOV for each device, we need to compute the total order value and total order items for each device. To achieve this, we will group the selected variables by device type. Using the group_by() function, we will group our case study data by device types. The first argument is the name of the data frame and the second argument is the name of the column based on which the data will be split. Let us look at a few examples:

Group Data - Example 1


group_by(ecom,     )
group_by(ecom, referrer)

Group Data - Case Study

Instructions


In the second line in the previous output, you can observe Groups: referrer [5]. The data is split into 5 groups as the referrer variable has 5 distinct values. For our case study, we need to group the data by device type.

ecom3 <- group_by(ecom2,      )
ecom3
ecom3 <- group_by(ecom2, device)
ecom3

Summarize - Intro


The next step is to compute the total order value and total order items for each device. We will use them to then compute the average order value. Now we need to reduce the order value and order items data to a single summary. We can achieve this using the summarise() function. The first argument is the name of a data frame and the subsequent arguments are functions that can generate a summary. For example, we can use min, max, sum, mean etc.

Summarise - Practice 1


Instructions


For our case study, we need the totals of order value and order items. What function can we use to obtain them? The sum() function will generate the sum of the values and hence we will use it inside the summarise() function. Remember, we need to provide a name to the summary being generated.

ecom4 <- summarise(ecom3, total_value = sum(order_value),
          total_items = sum(order_items))
ecom4

There you go, we have the total order value and total order items for each device type. Another way to achieve the above result is to use the summarise_all() function. How does that work? It generates the specified summary for all the columns in the data set except for the column based on which the data has been grouped. So we need to ensure that the data frame does not have any irrelevant columns.

Summarize: Case Study


Instructions


In our case study, we have split the data based on the device type and we have 2 other columns which are order value and order items. If we use summarise_all() function, it will generate the summary for the two columns based on the function specified. To specify the functions, we need to use another argument funs and it can take any number of valid functions.

ecom4 <- summarise_all(ecom3, funs(sum))
ecom4

Mutate - Intro


Now that we have the total order value and total order items for each device category, we can compute the AOV. We will create a new column to store the result. To create a new column, we will use the mutate() function. The first argument is the name of the data frame and the subsequent arguments are expressions for creating new columns based out of existing columns.

Mutate - Example 1


Mutate - Case Study


Instructions


ecom5 <- mutate(ecom4, aov =                          )
ecom5
ecom5 <- mutate(ecom4, aov = order_value / order_items)
ecom5

AOV by Devices


Let us combine all the code from the previous steps:

ecom1 <- filter(ecom, purchase == "true")
ecom2 <- select(ecom1, device, order_value, order_items)
ecom3 <- group_by(ecom2, device)
ecom4 <- summarise_all(ecom3, funs(sum))
ecom5 <- mutate(ecom4, aov = order_value / order_items)
ecom6 <- select(ecom5, device, aov)
ecom6

Using Pipe - Filter


ecom %>%
  filter(purchase == 'true') 

Using Pipe - Select


ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) 

Using Pipe - Group Data


ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) %>%
  group_by(device) 

Using Pipe - Summarize


ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) %>%
  group_by(device) %>%
  summarise_all(funs(sum)) 

Using Pipe - Mutate


ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) %>%
  group_by(device) %>%
  summarise_all(funs(sum)) %>%
  mutate(
    aov = order_value / order_items
  ) 

Putting it all together..


If you observe, at each step we create a new variable(data frame) and then use it as an input in the next step i.e. the output from one function becomes the input for another function. Can we achieve the final outcome i.e. ecom6 without creating the intermediate data frames (ecom1 - ecom5)? Yes, we can. We will use the %>% operator to chain the above steps so that we can avoid creating the intermediate data frames. Let us see how to do that.

ecom %>%
  filter(purchase == 'true') %>%
  select(device, order_value, order_items) %>%
  group_by(device) %>%
  summarise_all(funs(sum)) %>%
  mutate(
    aov = order_value / order_items
  ) %>%
  select(device, aov)

In the above code, we take the output from each step and use it as an input for the next step using the pipe %>% operator. It reduces the intermediate data frames and makes the code readable. Take the ecom data frame -> filter the purchasers -> select device, order_value and order_items variables -> group the resulting data frame by device type -> compute the sum of all the variables in the grouped data frames -> compute the average order value -> select device type and aov

It is the same as the steps we wrote at the beginning of this module. Now you realize how powerful dplyr is along with the %>% operator. There are other functions in dplyr but we will cover them in another module. You are encouraged to use the above approach to answer the questions we have listed below.

Practice Questions


  • what is the average number of pages visited by purchasers and non-purchasers?
  • what is the average time on site for purchasers vs non-purchasers?
  • what is the average number of pages visited by purchasers and non-purchasers using mobile?