Hacking strings with stringr

Introduction


In this module, we will learn to work with string data in R using stringr. As we did in the earlier modules, we will use a case study to explore the various features of the stringr package. You can download the data for the case study from here or directly import the data using the readr package. Let us begin by installing and loading stringr and a set of other pacakges we will be using.

Case Study

Introduction


  • extract domain name from random email ids
  • extract image type from url
  • extract image dimension from url
  • extract extension from domain name
  • extract http protocol from url
  • extract domain name from url
  • extract extension from url
  • extract file type from url

Data


mockstring

Sample Data


Since the data set has 1000 rows, we will use a smaller sample for better readability.

mockdata <- 
mockdata
mockdata <- slice(mockstring, 1:10)
mockdata

Detect @


Detect @


One of the columns in the case study data is email. It contains random email ids. We want to ensure that the email ids adher to a particular format .i.e

  • they contain @
  • they contain only one @

Let us first detect if the email ids contain @. Use str_detect() to detect @.

str_detect(mock_data$email, pattern = "@")

Count @


Count @


Use str_count() to count the number of times @ appears in the email ids.

str_count(mock_data$email, pattern = "@")

Concatenate


Concatenate


We can use str_c() to concatenate strings. Let us add the string email id: before each email id in the data set.

str_c("email id:", mock_data$email)

Split


Split


If we want to split a string into two parts using a particular pattern, we use str_split(). Let us split the domain name and extension from the domain column in the data. The domain name and extension are separated by . and we will use it to split the domain column. Since . is a special character, we will use two slashes to escape the special character.

str_split(mock_data$email, pattern = "@")

Truncate

We can truncate a string using str_trunc(). The default truncation happens at the beggining of the string but we can truncate the central part or the end of the string as well.

str_trunc(mockdata$email, width = 10)
str_trunc(mockdata$email, width = 10, side = "left")
str_trunc(mockdata$email, width = 10, side = "center")
str_split(mock_data$email, pattern = "@")

Sort


Sort


Strings can be sorted using str_sort(). Let us quickly sort the emails in both ascending and descending orders.

str_sort(mock_data$email)

Sort


Sort


str_sort(mock_data$email, descending = TRUE)

Case


Case


The case of a string can be changed to upper, lower or title case as shown below.

str_to_upper(mockdata$full_name)
str_to_lower(mockdata$full_name)

Replace


Replace


Parts of a string can be replaced using str_replace(). In the address column of the data set, let us replace:

  • Street with ST
  • Road with RD
str_replace(mockdata$address, "Street", "ST")
str_replace(mockdata$address, "Road", "RD")

Extract


Extract


We can extract parts of the string that match a particular pattern using str_extract().

str_extract(mock_data$email, pattern = "org")

Match


Match


Before we extract, we need to know whether the string contains text that match our pattern. Use str_match() to see if the pattern is present in the string.

str_match(mock_data$email, pattern = "org")

Index


Index


If we are dealing with a character vector and know that the pattern we are looking at is present in the vector, we might want to know the index of the strings in which it is present. Use str_which() to identify the index of the strings that match our pattern.

str_which(mock_data$email, pattern = "org")

Locate


Locate


Another objective might be to locate the position of the pattern we are looking for in the string. For example, if we want to know the position of @ in the email ids, we can use str_locate().

str_locate(mock_data$email, pattern = "com")

Length


Length


The length of the string can be computed using str_length(). Let us ensure that the length of the strings in the password column is 16.

str_length(mockdata$passwords)

Extract


Extract


We can extract parts of a string by specifying the starting and ending position using str_sub(). Let us extract the currency type from the currency column.

str_sub(mock_data$currency, start = 1, end = 1)

Word

Word


One final function that we will look at before the case study is word(). It extracts word(s) from sentences. We do not have any sentences in the data set, but let us use it to extract the first and last name from the full_name column.

word(mock_data$full_name, 1)