In this video we mainly focus on how to extract data from a string variable in R i.e. we use str_sub, substr, separate_longer_delim(), separate_longer_position(), separate_wider_delim(), separate_wider_position(). We also discuss other aspects of string variable i.e. how to get the length of the string variable, and change case of string variable to upper. lower, title or sentence case. The we move to removing white spaces fro string variable and remove a specific character from a string.
00:00 Intro to video
0:49 Str_sub VS Substr
3:12 separate_longer_delim
5:27 separate_longer_position
6:41 separate_wider_delim
8:25 separate_wider_position
9:31 nchar VS str_length
10:30 Change Case of String
11:15 Remove white spaces
13:20 Remove specific string
Website: thedatahall.com
As an Amazon Associate, I earn from qualifying purchases.
Show More Show Less View Video Transcript
0:00
Welcome to the Data Hall YouTube channel
0:02
In past couple of videos, we had been talking about how do we work with string variables in R
0:08
We are going to continue the string variable chapter in this video
0:13
So in this video, we are going to cover how do we extract data from a string variable
0:18
We are going to discuss extracting string, working with string length, changing case, removing
0:24
wide spaces, whether they are training, bleeding or they are within the text
0:30
and working with specific or removing specific string from a text. So let's load the TidyWorse library
0:41
We're going to load the Tidyverse library. If you haven't installed it, you need to first install it and then execute it
0:47
So firstly, let's look at these two functions, which is STR sub versus subSTR
0:54
This subSTR is the base R function, whereas this STR sub as we know is coming from the string
0:59
our library that is passed part of the Tidiverse package. So if we were to use the STR sub to extract certain text, let's say we have the zip code
1:10
zip for triple zero nine, and we know that a zip code is of five digits
1:16
So just we want to extract the numerical part of this zip code
1:20
What we do is we use the STR sub and then specify the string that we want to work with
1:26
starting position and the ending position. So we want to start from the character number of
1:35
I mean the character at the fourth position. So this is the first, second, third and fourth position
1:40
and we want to go till the eighth position. If you press control enter
1:44
we get the numerical part of this zip code. Similarly, if we have, let's just say
1:50
different, sometimes we work with with files, with CSV file and we want to remove
1:56
remove the ending part of the file name. Whether that is a CSV file or any other file
2:03
but normally the extension is of three digits. So what we can do is with STR sub
2:08
we have this minus, it can take minus values, and what this minus value would mean is that
2:15
start from the first position of the string, that is this one
2:20
and extract all the strings till the minus fifth position from the end of the series
2:27
So that should be till C. So if you press Control Enter
2:31
it should give me the name of that file. Now, this is..
2:36
Now, we want to compare this with sub-STR, which is the base R function
2:41
but do remember that in sub-STR, the syntax is exactly the same
2:46
but one thing that we cannot do is we cannot have negative integers
2:52
If we give the active integers, it would return an M-STERS. pre-string. So that is the difference between sub-stDR and STR sub. So STR sub is somewhat
3:01
superior to sub-STR which is coming. This is the base R function and this is the function
3:10
from the string R library. There are certain functions from the TIDER package and these are four
3:18
of these functions which I'm going to discuss. First one is they all start with the separate
3:24
And then there is this middle word which is longer and then whether you want to separate the text based on a delimiter or a specific position So this function would separate the text into longer format that is convert the tax into rows
3:44
So split into rows based on delimiter. And this one would do the same as this one
3:50
but split the text based on certain positions. So it splits into rows based on position
3:56
The other one, which is the wider functions, these would split the text into position
4:02
columns. So let's start with the first one which is split, a separate, longer delimiter
4:08
So let's say we have this data frame over here and this data frame just have one column
4:13
and it contains certain text. We have different numbers, different alphabets, we can have separate
4:20
alphabets, but they all are separated by a comma. So let's call it an example data frame
4:27
And if I can show you the data frame, we have these three rows. What we want is, we want is
4:32
we want to take all these characters separate them by comma, right
4:38
And we want to convert them into rows. So what we want to do is separate this text into a longer format because we want them into rows using a delimiter, which is comma
4:49
So we take the data frame, the example data frame, and then we input it into this function
4:57
where the first element of this function would be the comma, sorry, column that we want to
5:02
to delimit or the string that we want to delimit and the second function would be the delimit itself
5:08
which is how do we separate these characters and that is using comma so if i press control enter
5:17
we can see that it had separated using the comma delimiter and converted them into rows we could
5:24
have saved them in a data frame let's work with the second function which is separate longer
5:30
based on position and let's say we have this data frame where we have different digits or
5:38
characters but they aren't separated using comma rather they are separated I mean they can be
5:45
separated based on certain position but there is no specific delimiter either space or comma or tab or dot
5:53
so they are they can be separated based on position but not based on delimiter if i can have the example
5:59
this would be example to data frame. We have these strings and let's take this example to data
6:05
frame and then we we pass that into this separate longer position data frame function. We wanted to take
6:17
the column A and separate them using one width. This is the width of the text that it would separate
6:25
and you get to see that it had converted into longer or rows
6:31
So longer means that we want it to be converted into longer, which is similar to the pivot longer and pivot, pivot wider concept
6:40
Let's work with the other two functions which would separate them into columns
6:45
So we work with separate wider day limit. We have this data frame which contains date of births
6:52
It just have one column that is date of birth. of different individuals
6:57
We have these data bursts and they are separated by a dash
7:02
Now we want to have a separate column for months, for days for month and for year So we want to have a three column That would contain the day that would extract the date the month and the year from this date So let take this data frame
7:23
Let's first create this data frame. Let me show you the data frame. We have these dates
7:29
Let's take this data frame and then pipe it into the separate wide delimit function
7:36
we take the column DOB and we delimit it using a dash
7:43
Now this is the same parameters that we had in the separate longer case
7:50
But the new argument that we have over here is the names
7:55
That is the name of the column that would be generated. Because in wider case, in separate wider, we would have columns rather than
8:06
these texts to be separated in rows. So we want it to have day, month, and year column
8:13
And let's generate this. And you can see we have three columns, day, month, and ear
8:18
And the text had been separated using this hyphen sign. Next, we want, let's say we have this kind of date
8:27
These are the same dates. But now they are not separated using dash
8:32
Rather, they are separated based on specific position. So each date of birth is of the same size
8:39
So as compared to this one, so we didn't had 0 4
8:43
but in this case we have 0 4. So the all the day, the days are exactly of two digits
8:51
the months are exactly of two digits, and the year is exactly of four digits
8:56
So let's generate this data frame. This data frames looks like this
9:01
We want to create a gain three column. By this time, what we are going to do is we are going
9:06
to separate them based on position. So it would take two arguments
9:10
First is the column and the second one is the width. And we would create a vector of width
9:17
The day column would have a length of two digits, month column would have length of two digits
9:23
and ear column would have length of four digits. If I press control until we get these separate
9:30
Let's move to our next topic. Now, this isn't particularly related to a external
9:36
extracting string, but these come in handy when we work with string variables
9:44
So let's say we have this string and we want to know the length of the string, the number of
9:51
characters that are there in this string. We use the STR Length function
9:55
Now this STR Length function would give us 8 and that means we have eight characters within
10:02
this string. Now this STR length function is coming from the string R library, but it is somewhat easy to remember because we know we want to get the length of the string
10:15
There is this base R function which is known as N car and that would give us the exact same output
10:24
These are identical function. The only thing is that they have a different name
10:30
Then we have different functions to change case of string. Like we can convert it into upper string, lower title case or sentence case
10:40
They all come from the string R package. That's what we know when we see STR underscore
10:46
So STR to upper string to upper We have these characters They are in lower case and we got it converted into upper case or into lower case using STR to lower we can
11:01
convert this into title case where each the first alphabet of each word would
11:07
be capitalized or the sentence case where just the first alphabet of the first
11:14
word would be capitalized okay let's now work with wise spaces and how do we remove these wide spaces from text? Let's say we have a text over here
11:22
where we have whites at the start of the text and then we have multiple white spaces at the middle
11:29
of the text and at the end of the text. Now this STR trim function what it would do is it would
11:36
only eliminate these starting and ending spaces, multiple spaces. If we press control enter
11:43
we can see that the starting and ending spaces had been removed
11:47
but we didn't remove the middle spaces that were there. Now this is somewhat similar to STR trim function
11:56
but STR trim is somewhat difficult to use, and this is this comes from the base R
12:02
whereas this STR underscore trim comes from the string R package. So we'll stick to STR underscore trim
12:10
Now, the way around of working with these multiple within text spaces is we can either remove them by using this STR replace all function
12:22
And within this, what we need to do is first we need to specify the text and then we need to specify
12:28
the pattern that would identify these spaces. And what we want is remove all these spaces by a single space
12:40
So if I press control enter, it would remove. all the spaces that are within the text but obviously it would have the starting
12:51
spaces so a better function to work with is STR Squish function and it would remove
12:58
it works like it's quite an easy way of working with you do not have to worry about
13:03
the rejects this is what we would cover in our next video it would remove the starting
13:08
spaces the leading trailing spaces and multiple spaces that are there within the text
13:15
We can obviously pipe it into other functions if you want to
13:21
Lastly, we are going to work with how do we remove a specific string from a text or we replace
13:27
it with another text. So there is this string remove or string remove all function
13:33
They work identical except that string remove would only replace the first occurrence of
13:39
that specific character. So let's say we have over here certain text
13:43
And we want to replace, we want to remove A from this text
13:49
So what this would do, string remove would only remove the first occurrence of that specific text
13:56
Whereas string remove all would remove all the occurrence of that text
14:01
Similarly, we have string replace where it would replace this text with this specific text
14:09
So again, it would remove only the, replace only the first occurrence
14:13
but if we use string replace all it would replace all the occurrence of that specific text
14:20
So I hope this video was useful. Do subscribe to this channel, stay tuned and do hit the like
14:26
button. Thanks for watching this video


