In this video we work with factor variables in R. We start with defining the factor variable using factor() and as.factor() functions from base R. Then we discuss the fct() function from the forcats package. We move on the dropping unused levels using fct_drop(), modifying factors order using fct_relevel(), fct_infreq(), fct_rev() and fct_reorder(). Lastly, we discuss modifying factor levels using fct_recode and fct_collapse.
00:00 Install packages
1:52 Defining factors using factor()
6:26 as.factor()
7:48 fct() from forcats
10:18 Dropping unused levels
12:14 Manually Modifying factors order
13:49 Order based on frequency
15:44 Order based on values of another variable
17:19 Modifying factors levels (fct_recode)
18:36 Factor collapse (fct_collapse)
Website: thedatahall.com
As an Amazon Associate, I earn from qualifying purchases.
Show More Show Less View Video Transcript
0:00
Welcome to the data how YouTube channel. In this video we are going to work with the factor
0:04
variables or how do we define factor levels, recode the factor levels in ahar. So what
0:12
we are going to discuss is how do we define factors, how do we drop unused factors, changing
0:18
the order of the factors or changing level of the factors. So let us get started. First
0:24
let's look at the package that we are going to use. We are going to use the 4 cats package
0:28
that is part of the tidyverse package whether you use the load the 4 cats package or the
0:34
tidyverse package that's one and the same thing. But usually I do the practice of loading
0:40
the tidyverse so that I can load all the packages that comes within tidyverse like the ggplot
0:45
the readr, the stringr package etc etc. So you need to install these libraries. I have
0:53
already installed them so I am not going to install it. I am just going to load the tidyverse
0:58
library. So once that is loaded you can see that it loads the dplyr, the 4 cats, the ggplot
1:05
the libridate etc for working with different types of data. So let's move and let me first
1:13
show you how the 4 cat package is like its functions are named. As with the stringr package
1:21
that we discussed in our previous videos it starts with fct. So the stringr package or
1:26
the functions in stringr package starts with str but in the 4 cats all the functions starts
1:34
with fct. That's way if we know that this specific function is from the 4 cats package
1:41
we can simply and we cannot remember it we can simply write fct and search that function
1:48
within the list right. Ok so let's move forward and let's define our data frame. Let's say
1:55
we have this data that contains a name of different students their level of education
2:02
different individuals their level of education and the marks that they have obtained or the
2:07
percentage of marks they have obtained in their last degree. So let me load this data
2:13
and let me show you how this data looks like. We have names, education, the level of education
2:20
and their marks in last degree. So if I check the structure of this data we can see that
2:27
the marks is a numerical variable it's structured as num its data type is num numerical name
2:35
is character is data type is character but this education is basically categorical variable
2:40
it is by default categorized as a string or a character variable and the issue with this
2:49
is that if you are working with factors and there is some order in that factor then it
2:55
might get I mean it might not give us the meaning that we would want to derive from
3:00
it. So let me arrange this data let's take this data and sort it based on education and
3:07
if I can show you this data again it sorted it on the basis of education so shouldn't
3:12
the primary be first and then we should have the secondary level of education and then
3:18
the BS and MS but what it did is it sorted it on the basis of alphabets right. So B comes
3:25
before M and so on and so forth. So it sorted it on the basis of on the basis of alphabets
3:32
as it would do with any character data. Now what we need to do is we need to define the
3:37
factor variables. So first I'm going to discuss how do we define factor valuable using base
3:43
R not in the four cats package but in base R 1 what issue it is with the base R definition
3:51
of factor variables and then I'm going to look into the advantage of four cats and why
3:57
defining variables using why defining categorical variables using four cats is more advantageous
4:03
So let's let's start with this. We know that there are four categories but if we wanted
4:09
to look into these unique values of those categories we can use unique function and
4:14
specify the specific column and the data frame and we get these these levels. Now let's first
4:21
define the education level in a string variable we defining them in this order and this would
4:28
become the order of this categorical variable when once we assign it once we assign this
4:33
education level to our categorical variable. So let's define this over here we have just
4:39
defined the levels right. Now what we are going to do is we are going to assign these
4:44
levels to our education variable. So we take the data data frame and then we mutate a variable
4:53
education which is already there but it would modify that variable. We want it to be a factor
4:59
variable that would take values from the education column by itself and its levels
5:05
would come from the education levels that we have just defined and we want it to be
5:11
saved into a new data frame which is called students. If I execute this show you the data
5:17
nothing happens everything looks similar but let's look at the structure of this data frame
5:22
you can now see that this education had been defined as a categorical or factor variable
5:28
and if you wanted to look at the levels of this education variable we can see that these
5:33
are the levels. Now if you want to look at the now each each category each categorical
5:39
these are the levels and they do have these numerical values as you can see over here
5:45
these are different numerical values that we assign to these and these categories. If
5:50
you want to look at these numerical values we can have these numerical values because
5:56
the first three are bachelors so these have a value of 3 then we have MS which is having
6:02
a value of 4 and so on and so forth. So you can see that primary have a value of 1 then
6:08
secondary would have a value of 2 so now there is an order and if we arrange this data frame
6:14
on the basis of education we can see that now this student's data would be arranged
6:22
based on the level of education and not based on the alphabetical order. So let's now if
6:29
now this was when we had some order and we defined the the order first the levels first
6:35
and then we assigned it we could have simply used the as dot factor function. Let's take
6:41
the students data mutate make a new column which we already have but let's modify it
6:48
and using as lot factors convert this variable into a factor variable and let's save it into
6:56
students factor. Now in this specific case we didn't define any levels beforehand and
7:05
now this although it had been converted into a factor variable but the orders are not what
7:13
we would have wanted. So let me arrange this factor var with this new data frame based
7:18
on education. Let me show it to you and you can see that now in this case although it
7:25
had been but usually it wouldn't because we took the data from the from the students data
7:32
frame if we were to take it from the main data source and do it once again you would
7:40
see that the order is not according to the level of education but it is according to
7:47
the alphabetical order. Right. So how do we define factors using four cards and what
7:55
is the advantage of using four cards. So let me let me change this factor. Let me change
8:02
the I mean let's let's have a wrong spelling in the definition of these levels factor levels
8:11
and let me execute this again and execute this part of the command once again. Now what
8:18
would happen is because we do not have secondary in our levels that we have defined the levels
8:25
that we have defined what R would do is it would omit that because now secondary we had
8:33
secondary in our data frame but because that is not in the in the levels that we have defined
8:40
what R would do is replace it with the missing value. This N stands for the missing value
8:46
and it didn't give us any warning and that is something that you wouldn't want to happen
8:53
if you have defined the key wrong in a wrong way then you would want R to give you some
8:59
error. Now let's see what happens if we if we use the four cards package. So let's take
9:06
this data and and then mutate this education variable. Now now using the FCT function which
9:15
comes from the four cards package take the education assign factors to it based on the
9:21
levels coming from the education level that we have defined over here. They are still
9:25
wrong but let's save it into a new data frame and let me show you this data frame. Now you
9:34
execute. OK. This data frame had not been defined. It had not been created. This is
9:40
the old data frame that I created and it gave me an error and the error is saying that the
9:45
missing level is secondary. Now if if I corrected this let me correct this and re-execute these
9:55
commands and now when executed it would be executed the new the new the new data frame
10:04
had been created and it looks perfectly fine. So this is the main advantage that when you
10:10
define some fact some levels I mean in a wrong manner then this would throw an error. Now
10:18
let's see how do we drop an unused level. Let's say we have this we take this students
10:23
data frame and we just want to filter the the individuals who have the MS level of education
10:29
So let's filter this save it into this master's data frame. Now this master's data frame contains
10:36
just two two individuals. But if we look at the structure of this master data frame we
10:44
have four levels right. All the primary secondary etc. levels are there and if we look at the
10:49
levels particularly from this master frame we do still have these levels because we had
10:55
defined these levels in the student data frame and the although we have filtered it but these
11:01
levels just stick with this data frame and that gets messy because when we look at the
11:07
levels using this levels function we might get a wrong impression that there are four
11:13
levels in this specific category in this specific variable. So what we want to do is we want
11:18
to drop all the unused we don't know which one are unused but we want to drop all the
11:24
unused factors. So we take this master's data frame and then mutate this education variable
11:31
and and we use the fact and drop function to drop any unused factors factor levels from
11:39
this data frame. And if I can show you the levels now we just have MS. So let's move
11:47
forward. We want to count the number of observations in each category. We take the we use the count
11:53
function take the data frame. We want to count the education column and then sort and sort
12:00
it on the basis of the frequency. So we have three individuals from BS level of education
12:08
to from master's level of education and one from each primary and secondary. So let's
12:14
move to another part of this video which is how do we modify the order of this factor
12:21
variable. So there are different ways of modifying the order. First we can assign the order manually
12:28
We can change the order manually. I mean we are not creating the order but we want the
12:33
order had been created in the students data frame and we want to change it so we can change
12:37
it manually or we can change it based on the frequency of occurrence. So we want the highest
12:44
frequency to occur first and the lowest frequency to occur so on and so forth. Or we can change
12:49
it on the basis of the value of another variable. So let's start with the manual reordering
12:56
We take the students data frame mutate the education variable and now we are going to
13:02
use the factor relevel which relevels the order on the basis of the order that we assign
13:11
it. So we take the education column and assign this order. So what we want is we want the
13:18
MS to appear first and then followed by BS degree. In this case in the college level
13:25
degrees we want it to be in descending order but in the non-college level degrees we want
13:30
it to be somehow some due to some reason we want it to be in ascending order. And let's
13:36
arrange it so I can show you the data frame. You can see that now we have MS BS and then
13:44
we have primary and secondary the exact same order that we have assigned. Now this was
13:50
reordering based on the manual reordering. Let's see if we want to reorder on the basis
13:56
of the frequency of occurrence of those those levels. So let's create a plot using ggplot
14:03
We take the students data frame on the x axis we want the education variable and we
14:09
want the bar chart. If I create the bar chart there is no order in it. Although the number
14:16
of individuals in our data sets are more from the BS level of education but you know we
14:24
would want it for it to give us more meaning. We want this visualization to be in some order
14:30
So what one thing that we can do is before making this bar chart we can use the fact
14:37
in frequency that would order this education variable on the basis of the frequency of
14:44
occurrence. And this is how it would look like. Now if you wanted to reverse the order
14:49
we can use the fact reverse and that would reverse the order. Now in this case we didn't
14:57
change the order within the data frame rather we change the order while we are plotting
15:02
the data but we can change it within the data frame. So let's take the students data mutate
15:09
the education variable that would change the order use the fact in frequency and change
15:15
the levels on the basis of the frequency. If I store that and check the levels now now
15:23
we can see that the highest number of individuals are from BS then MS then the primary and secondary
15:30
have one from from each level of education. So that's how we order the frequency we order
15:39
the levels on the basis of the frequency of occurrence of those categories. If you want
15:45
to order on the basis of values coming from some other variable let's say we have this
15:50
marks variable let's take the students that group it by education and we want to summarize
15:57
the marks we want to calculate the mean marks for each level of education. So we want to
16:04
take all the students and let's say we have this we have created this fact summarized
16:11
so we have this level of educations and we have their mean marks. Now we want to plot
16:16
this we want to create a scatter plot where we take the students summarize data on x axis
16:25
we want the marks on y axis we want the level of education. Let's create this one. We have
16:32
this scatter plot but we cannot make sense of it right. But we would want it we would
16:38
want to reorder these these on the basis of the mean value. So so let's let's use the
16:49
same command but now what I have done is I have used the fact reorder education variable
16:55
on the basis of marks. So if I execute this now you can see it it is giving us a meaning
17:03
we want we can have a reverse order by just having this minus sign along with it. And
17:09
obviously as we did with the fact in frequency we can build these orders these levels within
17:18
our data frame. Let's move forward. And now in this case we are modifying the order but
17:24
what if we want to modify the factor levels we want to record them. Now there are two functions
17:32
that we are going to look into. First one is the factor record and the second one is
17:37
the factor collapse. So what factor record does is it records the the level of the level
17:45
of the categorical variable. So let's take the students data frame mutate create this
17:50
education variable which is created already but it would modify the levels fact record
17:58
the level of education right for the education variable for the education column. This is
18:05
the new category the new label that we want to assign it and this is the old label. So
18:09
what we want is we want to convert all the primary and secondary into non college based
18:15
education levels and we want BS and MS to be college. So then we would have just two
18:20
categories and let's save this into this new data frame. So let me show you this data frame
18:28
we now have instead of BS and MS we now have college and non college level of education
18:36
What if we wanted to collapse like in this case we are actually collapsing but but this
18:42
factory record can be used for different categories doesn't mean that we just have to recollab
18:47
we just have to collapse them. It could have been like P and then we could have given it
18:54
like us and this could have been B and this could have been M. Now in that case we wouldn't
19:02
be we wouldn't be collapsing them. So let me record it. Let me show you the data frame
19:10
and you can see that we have recorded the label the values. But what if we wanted to
19:16
collapse it for that the better function is factor collapse. What we do is we take the
19:21
education variable. This is this would be the new category and it would be assigned
19:27
to multiple these multiple categories. This again would be the new category assigned to
19:32
these old multiple categories. So in essence what we are doing is we are collapsing the
19:40
data based on based on create and also creating new new categories. So this is how it would
19:50
look like. So I hope this video was useful. Do subscribe to this channel. Do hit the bell
19:55
icon and thanks for watching this video
#Computer Science
#Mathematics
#Statistics


