Import Data in R

Introduction


Importing or reading data from different sources and file formats is one of the basic and crucial stages of any data analysis project. In this tutorial, you will learn to read data into R, examine the different challenges and address them using the appropriate methods.

What will I learn?


  • read data from flat/delimited files
  • handle column names/file header
  • skip text/info in files
  • specify column/variable data types
  • read subset of columns/variables

File Types


Before we start reading data from different files, let us take a quick look at the different types of delimiters we have to deal with while reading or importing data.


Comma Separated Values




Semi Colon Separated Values




Space Separated Values




Tab Separated Values



Read Data


Let us read data from a csv (comma separated values) file and explore the options listed in the previous section. To read a file named xyz.csv, you will do the following i.e. use read_csv() and specify the path to the folder/directory where the file resides.

read_csv('folder_path/xyz.csv')


Instructions


  • read the file hsb2.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
# read data
read_csv('//home//hebbali_aravind//datasets//hsb2.csv')

Sometimes you may get an error while reading data from a file(which most of us see when we are trying to read data for the first time), follow the below instructions:

  • check the separator in the file and ensure it is a comma
  • check the file name
  • check the file path i.e. location of the file
  • ensure that the file name or path is enclosed in single or double quotes

Column Names


In some cases, files do not include column names or headers. If we do not indicate the absence of column names, readr will treat the first row from th data as the column name. Like we said before, it is a good practice to take a quick look at the data to check for the presence/absence of column names.

Column names are present




Column names are absent



Column Names


Use col_names to indicate whether the data includes column names. It takes two values, TRUE and FALSE. If set to FALSE, readr will generate column names. The below code reads data from a file which does not have column names present in the first row.

read_csv('directory_path/file_name.csv', col_names = FALSE)


Instructions


  • read data from hsb3.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
  • click here to check if the file has column names
  • if column names are absent, indicate it while reading data
# read data
read_csv('//home//hebbali_aravind//datasets//hsb3.csv', col_names = FALSE)

Column Names


We learnt how to indicate whether file includes column names or not. What if we want to specify names for the columns while reading data from the file. col_names can be used to specify column names as well. All that we need to do is to store the names as a character vector and supply it to col_names. Assume you are reading data from a file and need to specify column names, the below example shows how to do it.

column_names <- c('col_1', 'col_2', 'col_3')
read_csv('directory_path/file_name.csv', col_names = column_names)


Instructions


  • read data from hsb3.csv
  • the path of the directory is //home//hebbali_aravind//datasets//
  • use cnames provided below to specify the column names
# column names
cnames <- c("id", "female", "race", "ses", "schtyp", "prog", "read", "write", "math", "science", "socst")

# read data
cnames <- c("id", "female", "race", "ses", "schtyp", "prog", "read", "write", "math", "science", "socst")
read_csv('//home//hebbali_aravind//datasets//hsb3.csv', col_names = cnames)

Skip Lines


In certain files, you will find information related to the data such as:

  • the data source
  • column names
  • column description
  • copyright etc.

The data will appear after/below such text/information. While reading data from such files, we need to skip all the rows where the text is present. If we do not skip them, readr will consider them as part of the data.



Skip Lines


If the file has contents other than data in the first few lines, we need to skip them before reading the data. Use skip to skip a certain number of lines. For example, the below example shows how to skip the first 5 lines of a file.

read_csv('directory_path/file_name.csv', skip = 5)


Instructions


  • read data from hsb4.cv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • click here to check if the file has any text/info before data
  • if any text/info is present before data, count the number of rows covered by the text
  • use skip argument to skip the text/info and read from the row where the column names are present
# read data after skipping the required number of rows
read_csv('//home//hebbali_aravind//datasets//hsb4.csv', skip = 3)

Maximum Lines


Suppose the data file contains several thousands of rows of data and we do not want to read all of it. What can we do in such cases? readr allows us to specify the maximum number of rows to be read using the n_max argument. Suppose we want to read only 100 rows of data from a file, we can set n_max equal to 100.

read_csv('directory_path/file_name.csv', n_max = 100)


Instructions


  • read the first 120 rows from the hsb2.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • observe the last row in the output and check if it says # ... with 110 more rows
# read the first 120 rows of data
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', n_max = 120)

Column Specification


If you observed the output when we read data in the previous tab, it includes the data type detected for each column in the file. When you read data using readr, it will display the data type detected for each column/variable in the data set. If you want to check the data types before reading the data, use spec_csv(). We will learn to specify the column types in the next section.


Instructions


  • run the below code and observe the output
spec_csv('//home//hebbali_aravind//datasets//hsb2.csv')

Column Types


If you have observed carefully, when you read data using readr, it displays the column names and column types followed by the first 10 rows of data. readr determines the data type for each column based on the first 1000 rows of data. The data can be of the following types:

  • integer
  • double (decimal point)
  • logical (TRUE/FALSE)
  • character (text/string)
  • factor (categorical/qualitative)
  • date/time


Column Types




If you have observed carefully, when you read data using readr, it displays the column names and column types followed by the first 10 rows of data. readr determines the data type for each column based on the first 1000 rows of data. The data can be of the following types:

  • integer
  • double (decimal point)
  • logical (TRUE/FALSE)
  • character (text/string)
  • factor (categorical/qualitative)
  • date/time

Column Types


Before you read data from a file, use spec_csv() to see the data types as determined by readr. If it determines the data types correctly, you can go ahead and read the data else we will have to specify the data types and we will have to do that for all the columns we want to read and not just for those columns whose data type was wrongly determined by readr.

To specify the data types, we will use the col_types argument and supply it a list of data types. The data types can be specified using:

  • col_integer()
  • col_double()
  • col_factor()
  • col_logical()
  • col_character()
  • col_date()
  • col_time()
  • col_datetime()

While specifying the data types we also need to specify the categories of the categorical/qualitative variable. To do that, we use the levels argument within col_factor(). Let us read data from the mtcars5.csv file to understand data type specification.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = list(col_double(), col_factor(levels = c(4, 6, 8)),
                          col_double(), col_integer()))

Instructions


  • read data from hsb5.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • specify the column types for each column in the file
# read data while specifying column types
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = list(
  col_integer(), col_factor(levels = c(0, 1)), 
  col_factor(levels = c(1, 2, 3, 4)), col_factor(levels = c(1, 2, 3)), 
  col_factor(levels = c(1, 2)), col_factor(levels = c(1, 2, 3)),
  col_integer(), col_integer(), col_integer(), col_integer(),
  col_integer())            
)

Specific Columns


We may not always want to read all the columns from a file. In such cases, we can specify the columns to be read using col_types argument and supplying to it the names of the columns to be read. We will use cols_only() to specify the column names and their respective data types.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = cols_only(mpg = col_double(), 
                               cyl = col_factor(levels = c(4, 6, 8))))


Instructions


  • read data from hsb2.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • read the following columns only
    • id
    • prog
    • read
# read columns id, prog and read from hsb2.csv
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = cols_only(id = col_integer(),
  prog = col_factor(levels = c(1, 2, 3)), read = col_integer())
)

Skip Columns


Sometimes we may want to skip some columns while reading data i.e. suppose we have 10 columns and we want to read only 8 columns. In such cases, we can use cols_skip() to specify the columns that must be skipped while reading the data.

read_csv('//home//hebbali_aravind//datasets//mtcars5.csv', 
         col_types = list(col_double(), col_factor(levels = c(4, 6, 8)),
                          col_skip(), col_integer()))


Instructions


  • read data from hsb5.csv file
  • the path of the directory is //home//hebbali_aravind//datasets//
  • skip the following columns
    • id
    • prog
    • read
# read columns id, prog and read from hsb2.csv
read_csv('//home//hebbali_aravind//datasets//hsb2.csv', col_types = cols_only(id = col_integer(),
  prog = col_factor(levels = c(1, 2, 3)), read = col_integer())
)

Summary




The above table gives an overview of the functions for reading different types of files in readr and Base R. All the functions in readr offer a common set of options which are described below:

  • col_names: whether data includes column names
  • n_max: maximum number of lines/rows to read
  • col_types: data type of the columns
  • skip: number of lines/rows to skip

Practice


  • check the separator type in the following files and read them using appropriate read_xxx() function:

    • hsb.csv
    • mtcars.tsv
    • hsb1.csv
    • hsb.txt