Import Data in R

Agenda

In this module, we will learn to read data from:

  • xls
    • list sheets in an excel file
    • read data from an excel sheet
    • read specific cells
    • read specific rows
    • read specific columns
  • other statistical softwares
    • SAS
    • SPSS
    • STATA

List Sheets


An excel workbook may contain more than 1 sheet. We need to know the name of the sheets in order to read the data. excel_sheets() will return the names of all the sheet in the workbook. For example, the below code will return the names of all the sheets in the file xyx.xls.

excel_sheets('xyz.xls')


Instructions


  • list the sheets in the workbook sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
# list the sheets in sample.xls
excel_sheets('//home//hebbali_aravind//datasets//sample.xls')

Read Sheet


Great! Now that we know how many sheets are present in the sample.xls file and their respective names, it is time to read in some data. read_excel() reads data from .xls and .xlsx files. You may ask how does it know which sheet to read the data from. That’s rigt! In addition to the file name, we have to specify the sheet name or position in the workbook. Below example shows the different ways to specify the sheets in a workbook.

# specify sheet name
read_excel('xyz.xls', sheet = 'first')

# specify sheet position
read_excel('xyz.xls', sheet = 1)


Instructions

  • read the first sheet in the sample.xls file
  • the path of the directory is //home//hebbali_aravind//datasets//

# read the first sheet in sample.xls
read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1)

Read Specific Cells - Part 1


Alright! We now know how to read data from an excel sheet but we may not want to read all the data present in the sheet i.e. sometimes we might want to read a subset of the data (few rows of specific colums). In order to read a subset of data, we need to specify the associated rows and columns. For example, to read data from first 4 rows of columns B and C, we will specify "B1:C4" using the range argument. The input for the range must be enclosed in single/double quotes.



read_excel('xyz.xls', sheet = 1, range = "B1:C4")


Instructions


  • read data from the file sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read the first 5 rows of columns A, B and C
read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, range = "A1:C5")

Read Specific Cells - Part 2


There is another way to specify the subset of data that we might want to read. This approach includes specifying the following:

  • specify a cell to start with (say A3)
  • number of rows from to read including row 3
  • number of columns to read including column A

The range argument is used to specify the cell but in this approach instead of directly specifying the cells, we will use anchored() to specify the initial cell and the number of rows and columns to read. It takes two arguments:

  • anchor: the initial cell (say “A3”)
  • dim: the number of rows and columns as a vector (c(3, 2) for 3 rows and 2 columns)


read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, col_names = FALSE,
  range = anchored("A4", dim = c(3, 2)))

Read Specific Cells - Part 3


Instructions


  • read data from the file sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read data from 4th to 7th row of columns A & B
read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, col_names = FALSE,
  range = anchored("A4", dim = c(3, 2)))

Read Specific Cells - Part 4


There is another way to specify the subset of data that we might want to read. This approach includes specifying the following:

  • specify a cell to start with (say A3)
  • number of rows from to read including row 3
  • number of columns to read including column A

The range argument is used to specify the cell but in this approach instead of directly specifying the cells, we will use cell_limits() to specify the initial cell and the number of rows and columns to read. It takes two arguments:



read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1,
  range = cell_limits(c(1, 1), c(6, 4)))

Read Specific Cells - Part 5


Instructions


  • read data from the file sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read data from rows 1 to 4 and columns 2 to 4
read_excel('sample.xls', sheet = 1,
  range = cell_limits(c(1, 2), c(4, 4)))

Read Specific Rows


So far we have explored reading a subset of data i.e. certain rows and columns. In this section, we will read specific rows of all the columns form the excel sheet. We will continue to use the range argument and use cell_rows() to specify the rows from which to read the data. In the below example, we read data from the first 4 rows.

read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, range = cell_rows(1:4))


Instructions


  • read data from the file sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read data from rows 3 to 5
read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, range = cell_rows(3:5))

Read Specific Columns


Let us look at another scenario where we want to read data all the rows but only specific columns. Use cell_cols() to specify the columns from which the data must be read. In the below example, we use cell_cols() to read data from columns 2 and 3.

read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, range = cell_cols(2:3))


Instructions


  • read data from the file sample.xls
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read data from columns 1 to 4
read_excel('//home//hebbali_aravind//datasets//sample.xls', sheet = 1, range = cell_cols(1:4))

Other Softwares


Sometimes, we might have to read data from files of other statistical packages such as:

  • SAS (.sas7bdat)
  • SPSS (.sav)
  • Stata (.dta)

In R, the following packages allow us to read data from the above file formats:

We will use the haven package in this tutorial.

Stata


Stata files have the extension .dta and can be read using either read_dta() or read_stata().


Instructions


  • read data from airline.dta file
  • the path of the directory is //home//hebbali_aravind//datasets//
# read airline.dta file
read_stata('//home//hebbali_aravind//datasets//airline.dta')  

SPSS


SPSS files have the extension .sav and can be read using either read_sav() or read_spss().


Instructions


  • read data from employee.sav file
  • the path of the directory is //home//hebbali_aravind//datasets//
# read employee.sav file
read_spss('//home//hebbali_aravind//datasets//employee.sav')  

SAS


SAS files have the extension .sas7bdat and can be read using either read_sas().


Instructions


  • read data from airline.sas7bdat file
  • the path of the directory is //home//hebbali_aravind//datasets//
# read airline.sas7bdat file
read_sas('//home//hebbali_aravind//datasets//airline.sas7bdat')