dplyr Joins

Introduction

In a previous module, we learnt about dplyr verbs and used them to compute average order value for an e-commerce website data. In this module, we will learn to combine tables using different *_join functions provided in dplyr. Let us assume we have 2 tables x and y.

The following types of join functions are provided in dplyr:

• inner_join()
• left_join()
• right_join()
• semi_join()
• anti_join()
• full_join()

Case Study

Introduction We have two data sets with details of customers and their orders. We will use the above functions to combine the data sets in different ways to glean information about customer orders.

Data

Orders

order

Customers

customer

Example Data Inner Join inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

inner_join(customer, order, by = "id")

Left Join

left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. left_join(customer, order, by = "id")

Right Join

right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. right_join(customer, order, by = "id")

Semi Join

semi_join() returns all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. semi_join(customer, order, by = "id")

Anti Join

anti_join() returns all rows from x where there are not matching values in y, keeping just columns from x. anti_join(customer, order, by = "id")

Full Join

full_join() returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. full_join(customer, order, by = "id")

References

• https://dplyr.tidyverse.org/