Import Data in R

Agenda


  • read data from flat/delimited files
  • handle column names/file header
  • skip text/info in files
  • specify column/variable data types
  • read subset of columns/variables

File Types



Comma Separated Values




Semi Colon Separated Values




Space Separated Values




Tab Separated Values



Read Data


Let us read data from a csv (comma separated values) file and explore the options listed in the previous section. To read a file named xyz.csv, you will do the following i.e. use read_csv() and specify the path to the folder/directory where the file resides.

read_csv('folder_path/xyz.csv')


Instructions


  • read the file hsb2.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
# read data
read_csv('//home//hebbali_aravind//datasets//hsb2.csv')

Sometimes you may get an error while reading data from a file. In such case try the following:

  • check the separator in the file and ensure it is a comma
  • check the path to the directory where the file resides

Column Names


Column names are present




Column names are absent



Column Names


Use col_names to indicate whether the data includes column names. It takes two values, TRUE and FALSE. If set to FALSE, readr will generate column names. The below code reads data from a file which does not have column names present in the first row.

read_csv('directory_path/file_name.csv', col_names = FALSE)


Instructions


  • read data from hsb3.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
  • click here to check if the file has column names
  • if column names are absent, indicate it while reading data
# read data
read_csv('//home//hebbali_aravind//datasets//hsb3.csv', col_names = FALSE)

Column Names


We learnt how to indicate whether file includes column names or not. What if we want to specify names for the columns while reading data from the file. col_names can be used to specify column names as well. All that we need to do is to store the names as a character vector and supply it to col_names. Assume you are reading data from a file and need to specify column names, the below example shows how to do it.

column_names <- c('col_1', 'col_2', 'col_3')
read_csv('directory_path/file_name.csv', col_names = column_names)


Instructions


  • read data from hsb3.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
  • use cnames provided below to specify the column names
# column names
cnames <- c("id", "female", "race", "ses", "schtyp", "prog", "read", "write", "math", "science", "socst")

# read data
cnames <- c("id", "female", "race", "ses", "schtyp", "prog", "read", "write", "math", "science", "socst")
read_csv('//home//hebbali_aravind//datasets//hsb3.csv', col_names = cnames)

Skip Lines




Skip Lines


If the file has contents other than data in the first few lines, we need to skip them before reading the data. Use skip to skip a certain number of lines. For example, the below example shows how to skip the first 5 lines of a file.

read_csv('directory_path/file_name.csv', skip = 5)


Instructions


  • read data from hsb4.cv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • click here to check if the file has any text/info before data
  • if any text/info is present before data, count the number of rows covered by the text
  • use skip argument to skip the text/info and read from the row where the column names are present
# read data after skipping the required number of rows
read_csv('//home//hebbali_aravind//datasets//hsb4.csv', skip = 3)

Maximum Lines


Suppose we want to read only 100 rows of data from a file, we can use n_max to specify the maximum number of lines to read and set it equal to 100.

read_csv('directory_path/file_name.csv', n_max = 100)


Instructions


  • read the first 120 rows from the hsb2.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • observe the last row in the output and check if it says # ... with 110 more rows
# read the first 120 rows of data
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', n_max = 120)

Column Specification


If you observed the output when we read data in the previous tab, it includes the data type detected for each column in the file. When you read data using readr, it will display the data type detected for each column/variable in the data set. If you want to check the data types before reading the data, use spec_csv(). We will learn to specify the column types in the next section.


Instructions


  • run the below code and observe the output
spec_csv('//home//hebbali_aravind//datasets//hsb2.csv')

Column Types


In certain cases, we need to specify the data type of the columns. It might be related to dates or categorical variables. readr allows us to specify the data types using col_xxx functions which include:



Column Types




To specify the data types, we will use col_types argument and supply to it a list indicating the data type (using col_xxx) of each column in the data set.

Column Types


In the below example, we read data from mtcars5.csv file while specifying the data types. Keep in mind that we need to specify the data type for each column.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = list(col_double(), col_factor(levels = c(4, 6, 8)),
                          col_double(), col_integer()))

Instructions


  • read data from hsb5.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • specify the column types for each column in the file
# read data while specifying column types
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = list(
  col_integer(), col_factor(levels = c(0, 1)), 
  col_factor(levels = c(1, 2, 3, 4)), col_factor(levels = c(1, 2, 3)), 
  col_factor(levels = c(1, 2)), col_factor(levels = c(1, 2, 3)),
  col_integer(), col_integer(), col_integer(), col_integer(),
  col_integer())            
)

Specific Columns


We may not always want to read all the columns from a file. In such cases, we can specify the columns to be read using col_types argument and supplying to it the names of the columns to be read. We will use cols_only() to specify the column names and their respective data types.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = cols_only(mpg = col_double(), 
                               cyl = col_factor(levels = c(4, 6, 8))))


Instructions


  • read data from hsb2.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read the following columns only
    • id
    • prog
    • read
# read columns id, prog and read from hsb2.csv
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = cols_only(id = col_integer(),
  prog = col_factor(levels = c(1, 2, 3)), read = col_integer())
)

Skip Columns


Sometimes we may want to skip some columns while reading data i.e. uppose we have 10 columns and we want to read only 8 columns. In such cases, we can use cols_skip() to specify the columns that must be skipped while reading the data.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = list(col_double(), col_factor(levels = c(4, 6, 8)),
                          col_skip(), col_integer()))


Instructions


  • read data from hsb5.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • skip the following columns
    • id
    • prog
    • read
# read columns id, prog and read from hsb2.csv
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = cols_only(id = col_integer(),
  prog = col_factor(levels = c(1, 2, 3)), read = col_integer())
)

Summary




The above table gives an overview of the functions for reading different types of files in readr and Base R. All the functions in readr offer a common set of options which are described below:

  • col_names: whether data includes column names
  • n_max: maximum number of lines/rows to read
  • col_types: data type of the columns
  • skip: number of lines/rows to skip

Practice


  • check the separator type in the following files and read them using appropriate read_xxx() function:

    • hsb.csv
    • mtcars.tsv
    • hsb1.csv
    • hsb.txt