dplyr Joins

Introduction


In a previous module, we learnt about dplyr verbs and used them to compute average order value for an e-commerce website data. In this module, we will learn to combine tables using different *_join functions provided in dplyr. Let us assume we have 2 tables x and y.

The following types of join functions are provided in dplyr:

  • inner_join()
  • left_join()
  • right_join()
  • semi_join()
  • anti_join()
  • full_join()

Case Study

Introduction


We have two data sets with details of customers and their orders. We will use the above functions to combine the data sets in different ways to glean information about customer orders.

Data

Orders


order

Customers


customer

Example Data


Inner Join


inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

inner_join(customer, order)

Left Join


left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

left_join(customer, order)

Right Join


right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(customer, order)

Semi Join


semi_join() returns all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

semi_join(customer, order)

Anti Join


anti_join() returns all rows from x where there are not matching values in y, keeping just columns from x.

anti_join(customer, order)

Full Join


full_join() returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(customer, order)