Combining String Variable in R | Working With String Part 2
5K views
May 16, 2024
In this video we discuss how to combine or concatenate strings in R. We use the stringr package that is part of tidyverse package. 00:00 Intro to video 0:50 String concatenation 3:37 Missing value in string 6:44 Str_glue function 7:34 Str_flatten function Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel. We have started a series on different aspects
0:05
of string variables or strings in R. So it is working with strings in R. In our previous
0:12
video we discussed about how do we create a string variable. In this video we are going
0:16
to look into how do we combine different strings and in the next video we are going to talk
0:21
about how do we extract data from string variable. So as discussed in previous video we are working
0:26
with the tidyverse package. Let's load this tidyverse package and this video we are going
0:31
to work with how do we concatenate string variables and how do we work with missing
0:36
values when we have a string variable and there is this string glue and string flatten
0:42
function that we are going to work with in this video. So let's start with the string
0:47
concatenation. Let's say we have these two strings. We want hello and Adam to be combined
0:55
Obviously they are two different strings. So what we do is we use STR underscore C which
1:00
stands for string concatenation. So when I press control enter you can see that we have
1:07
hello Adam printed over here and that is what the string concatenation has done. It has
1:14
combined these two strings these two different strings. So let's say we have over here hi
1:19
and then we have two different this is a vector of two two strings that is Adam and
1:27
Rose and we want it so we know that when we combine two different two vectors of different
1:33
sizes it would repeat the smaller vector. So what this would do is it would print hi
1:38
with Adam and hi with Rose. So this is how it would look like. But normally we are working
1:45
with data frames. So let's say we have this names data frame. We are creating a data frame
1:51
over here. We have two columns in that data frame. First one is the first name the second
1:56
one is the last name. We have the names we have three different people in them. Let's
2:02
press control enter. We have this object called names and what we want to do is we want to
2:08
create a full name that would contain the first name and the last name. So what we do
2:14
is we take the names object mutate and then create a new column called first name that
2:20
would be equal to then this column would contain the combined first name and last name and
2:25
we can we know that we can use string concatenation function to combine these two columns. So
2:31
let's press control enter and let's look at the names object and we can see that we have
2:37
the full name but we do know that there is no space between them. What we can simply
2:41
do is have a space over here and separate them using commas. Now if you press control
2:49
enter recreate that column you would have a comma over here. Now do remember that string
2:56
concatenation is the exact same method that we have in base R which is called paste or
3:02
paste zero. So string concatenation is just a mask for paste zero. There are some some
3:07
changes that are there with string concatenation and as we move forward it would be evident
3:12
that what are those changes. So let's do this the space zero. We have exactly the same command
3:18
as over here except I'm creating a new column called full name two and I'm using the paste
3:24
zero function. The rest of the command is exactly the same. So you can see that the
3:29
output is exactly the same. So this string concatenation is exactly the same thing as
3:35
we have paste zero in base R. But let's look at how missing values are treated in both
3:42
these functions. So we have over here the same the same data frame but instead of having
3:48
the third name what I have is the third person have a missing last name. We do not know what
3:55
his last name is or there might not be any last name but whatever the case is we do not
4:00
have that data. So let's create this object names underscore M that would contain the
4:06
missing value. And now if we use the exact same command and concatenate using string
4:13
concatenation and using the paste zero function what you would see is that with string concatenation
4:21
it do respect for the missing value right. It do not print the missing value. The whole
4:28
object if there is missing value. So so with string concatenation we see this result but
4:36
with string with the paste zero function although we didn't had this space within them but we
4:43
can have the space just to make them comparable. So what you can see is that it had combined
4:53
N A with the first name. That is not what we would have expected. So this is where string
4:59
concatenation might might be helpful or we might want to let just say why there is one
5:07
way of dealing with this missing value in this specific case and that is let's say we
5:12
have this function and we want to create a column greeting and that would contain the
5:17
high and the last name. Let's print this and if we if we look at this we know that
5:26
string concatenation would not print the whole object even the high when one of the object
5:34
is missing that we are combining. One way of working with this is we can use this specific
5:39
function I'm not sure how to pronounce this but what it does is it take the first non
5:45
missing value. So we know that string concatenation if there is any missing value would print
5:51
a missing value right. So what we want is instead of this N A we just want a high salutation
5:58
So let's just say we are sending an email and if you know the last name we would send
6:02
it Hi Curry or Hi Paul but if you do not know the last name would you just send an email
6:08
with with a high greeting right. So what we are doing is we know that this thing would
6:14
present if it presents let's say if it if it gives us something non missing value then
6:22
this function would get the value from this object. Otherwise if it is a missing value
6:27
this object presents a missing value then it would it would print Hi. So what it would
6:32
do is it would print the first non missing value. Let me reopen that and we can see as
6:39
we expected we get Hi when there is a missing last name. Next let's move to this string
6:46
glue function. Now string glue does the same thing as string concatenation it combines
6:51
different objects or different strings. But the issue with string concatenation is we
6:57
should have all all the strings within these inverted commas right and it gets somewhat
7:02
tricky to work with. So the way around is to use the string glue function and now you
7:08
can see we just printed we just have written the the the strings by itself. And what about
7:16
the the variable name or the object name or the column name we would just enclose it in
7:22
the curly brackets. And so that gets us rid from the inverted commas right. It would print
7:29
the exact same thing but we now have an easy code to work with. Let's move to last function
7:36
of this video which is string flatten. Let's say we have this data frame where we have
7:41
two different students from a BS program and two students from an MS master's program
7:48
And what we want to do is we want to print a list of all the names that are there in
7:54
the BS program and all the names that are there in the MS program. So what we do is
7:59
we take the students data frame. Let me create this data frame. Let me show you the data
8:04
frame. This is the data frame how it looks like. We want to group the data. Take the
8:10
data frame group it by program names and now we want to summarize. So when we summarize
8:17
we we cannot use the string glue function we have to use the string flatten function
8:23
This is specifically designed for the summarize function. And what we do is we take the names
8:28
and separate them with a comma and a space right. So let me press control enter and let
8:33
me show you the result. So we can we have summarized. So we have BS program and these
8:39
are the two students that are there. We have Mike the name and the Mike and Edwards are
8:45
separated by a space and a comma. That happens also for the MS program. So I hope that was
8:52
useful. Do subscribe to this channel. Do hit the bell icon and thanks for watching this video