0:00
Welcome to the data how YouTube channel. In this video we are going to work with the factor
0:04
variables or how do we define factor levels, recode the factor levels in ahar. So what
0:12
we are going to discuss is how do we define factors, how do we drop unused factors, changing
0:18
the order of the factors or changing level of the factors. So let us get started. First
0:24
let's look at the package that we are going to use. We are going to use the 4 cats package
0:28
that is part of the tidyverse package whether you use the load the 4 cats package or the
0:34
tidyverse package that's one and the same thing. But usually I do the practice of loading
0:40
the tidyverse so that I can load all the packages that comes within tidyverse like the ggplot
0:45
the readr, the stringr package etc etc. So you need to install these libraries. I have
0:53
already installed them so I am not going to install it. I am just going to load the tidyverse
0:58
library. So once that is loaded you can see that it loads the dplyr, the 4 cats, the ggplot
1:05
the libridate etc for working with different types of data. So let's move and let me first
1:13
show you how the 4 cat package is like its functions are named. As with the stringr package
1:21
that we discussed in our previous videos it starts with fct. So the stringr package or
1:26
the functions in stringr package starts with str but in the 4 cats all the functions starts
1:34
with fct. That's way if we know that this specific function is from the 4 cats package
1:41
we can simply and we cannot remember it we can simply write fct and search that function
1:48
within the list right. Ok so let's move forward and let's define our data frame. Let's say
1:55
we have this data that contains a name of different students their level of education
2:02
different individuals their level of education and the marks that they have obtained or the
2:07
percentage of marks they have obtained in their last degree. So let me load this data
2:13
and let me show you how this data looks like. We have names, education, the level of education
2:20
and their marks in last degree. So if I check the structure of this data we can see that
2:27
the marks is a numerical variable it's structured as num its data type is num numerical name
2:35
is character is data type is character but this education is basically categorical variable
2:40
it is by default categorized as a string or a character variable and the issue with this
2:49
is that if you are working with factors and there is some order in that factor then it
2:55
might get I mean it might not give us the meaning that we would want to derive from
3:00
it. So let me arrange this data let's take this data and sort it based on education and
3:07
if I can show you this data again it sorted it on the basis of education so shouldn't
3:12
the primary be first and then we should have the secondary level of education and then
3:18
the BS and MS but what it did is it sorted it on the basis of alphabets right. So B comes
3:25
before M and so on and so forth. So it sorted it on the basis of on the basis of alphabets
3:32
as it would do with any character data. Now what we need to do is we need to define the
3:37
factor variables. So first I'm going to discuss how do we define factor valuable using base
3:43
R not in the four cats package but in base R 1 what issue it is with the base R definition
3:51
of factor variables and then I'm going to look into the advantage of four cats and why
3:57
defining variables using why defining categorical variables using four cats is more advantageous
4:03
So let's let's start with this. We know that there are four categories but if we wanted
4:09
to look into these unique values of those categories we can use unique function and
4:14
specify the specific column and the data frame and we get these these levels. Now let's first
4:21
define the education level in a string variable we defining them in this order and this would
4:28
become the order of this categorical variable when once we assign it once we assign this
4:33
education level to our categorical variable. So let's define this over here we have just
4:39
defined the levels right. Now what we are going to do is we are going to assign these
4:44
levels to our education variable. So we take the data data frame and then we mutate a variable
4:53
education which is already there but it would modify that variable. We want it to be a factor
4:59
variable that would take values from the education column by itself and its levels
5:05
would come from the education levels that we have just defined and we want it to be
5:11
saved into a new data frame which is called students. If I execute this show you the data
5:17
nothing happens everything looks similar but let's look at the structure of this data frame
5:22
you can now see that this education had been defined as a categorical or factor variable
5:28
and if you wanted to look at the levels of this education variable we can see that these
5:33
are the levels. Now if you want to look at the now each each category each categorical
5:39
these are the levels and they do have these numerical values as you can see over here
5:45
these are different numerical values that we assign to these and these categories. If
5:50
you want to look at these numerical values we can have these numerical values because
5:56
the first three are bachelors so these have a value of 3 then we have MS which is having
6:02
a value of 4 and so on and so forth. So you can see that primary have a value of 1 then
6:08
secondary would have a value of 2 so now there is an order and if we arrange this data frame
6:14
on the basis of education we can see that now this student's data would be arranged
6:22
based on the level of education and not based on the alphabetical order. So let's now if
6:29
now this was when we had some order and we defined the the order first the levels first
6:35
and then we assigned it we could have simply used the as dot factor function. Let's take
6:41
the students data mutate make a new column which we already have but let's modify it
6:48
and using as lot factors convert this variable into a factor variable and let's save it into
6:56
students factor. Now in this specific case we didn't define any levels beforehand and
7:05
now this although it had been converted into a factor variable but the orders are not what
7:13
we would have wanted. So let me arrange this factor var with this new data frame based
7:18
on education. Let me show it to you and you can see that now in this case although it
7:25
had been but usually it wouldn't because we took the data from the from the students data
7:32
frame if we were to take it from the main data source and do it once again you would
7:40
see that the order is not according to the level of education but it is according to
7:47
the alphabetical order. Right. So how do we define factors using four cards and what
7:55
is the advantage of using four cards. So let me let me change this factor. Let me change
8:02
the I mean let's let's have a wrong spelling in the definition of these levels factor levels
8:11
and let me execute this again and execute this part of the command once again. Now what
8:18
would happen is because we do not have secondary in our levels that we have defined the levels
8:25
that we have defined what R would do is it would omit that because now secondary we had
8:33
secondary in our data frame but because that is not in the in the levels that we have defined
8:40
what R would do is replace it with the missing value. This N stands for the missing value
8:46
and it didn't give us any warning and that is something that you wouldn't want to happen
8:53
if you have defined the key wrong in a wrong way then you would want R to give you some
8:59
error. Now let's see what happens if we if we use the four cards package. So let's take
9:06
this data and and then mutate this education variable. Now now using the FCT function which
9:15
comes from the four cards package take the education assign factors to it based on the
9:21
levels coming from the education level that we have defined over here. They are still
9:25
wrong but let's save it into a new data frame and let me show you this data frame. Now you
9:34
execute. OK. This data frame had not been defined. It had not been created. This is
9:40
the old data frame that I created and it gave me an error and the error is saying that the
9:45
missing level is secondary. Now if if I corrected this let me correct this and re-execute these
9:55
commands and now when executed it would be executed the new the new the new data frame
10:04
had been created and it looks perfectly fine. So this is the main advantage that when you
10:10
define some fact some levels I mean in a wrong manner then this would throw an error. Now
10:18
let's see how do we drop an unused level. Let's say we have this we take this students
10:23
data frame and we just want to filter the the individuals who have the MS level of education
10:29
So let's filter this save it into this master's data frame. Now this master's data frame contains
10:36
just two two individuals. But if we look at the structure of this master data frame we
10:44
have four levels right. All the primary secondary etc. levels are there and if we look at the
10:49
levels particularly from this master frame we do still have these levels because we had
10:55
defined these levels in the student data frame and the although we have filtered it but these
11:01
levels just stick with this data frame and that gets messy because when we look at the
11:07
levels using this levels function we might get a wrong impression that there are four
11:13
levels in this specific category in this specific variable. So what we want to do is we want
11:18
to drop all the unused we don't know which one are unused but we want to drop all the
11:24
unused factors. So we take this master's data frame and then mutate this education variable
11:31
and and we use the fact and drop function to drop any unused factors factor levels from
11:39
this data frame. And if I can show you the levels now we just have MS. So let's move
11:47
forward. We want to count the number of observations in each category. We take the we use the count
11:53
function take the data frame. We want to count the education column and then sort and sort
12:00
it on the basis of the frequency. So we have three individuals from BS level of education
12:08
to from master's level of education and one from each primary and secondary. So let's
12:14
move to another part of this video which is how do we modify the order of this factor
12:21
variable. So there are different ways of modifying the order. First we can assign the order manually
12:28
We can change the order manually. I mean we are not creating the order but we want the
12:33
order had been created in the students data frame and we want to change it so we can change
12:37
it manually or we can change it based on the frequency of occurrence. So we want the highest
12:44
frequency to occur first and the lowest frequency to occur so on and so forth. Or we can change
12:49
it on the basis of the value of another variable. So let's start with the manual reordering
12:56
We take the students data frame mutate the education variable and now we are going to
13:02
use the factor relevel which relevels the order on the basis of the order that we assign
13:11
it. So we take the education column and assign this order. So what we want is we want the
13:18
MS to appear first and then followed by BS degree. In this case in the college level
13:25
degrees we want it to be in descending order but in the non-college level degrees we want
13:30
it to be somehow some due to some reason we want it to be in ascending order. And let's
13:36
arrange it so I can show you the data frame. You can see that now we have MS BS and then
13:44
we have primary and secondary the exact same order that we have assigned. Now this was
13:50
reordering based on the manual reordering. Let's see if we want to reorder on the basis
13:56
of the frequency of occurrence of those those levels. So let's create a plot using ggplot
14:03
We take the students data frame on the x axis we want the education variable and we
14:09
want the bar chart. If I create the bar chart there is no order in it. Although the number
14:16
of individuals in our data sets are more from the BS level of education but you know we
14:24
would want it for it to give us more meaning. We want this visualization to be in some order
14:30
So what one thing that we can do is before making this bar chart we can use the fact
14:37
in frequency that would order this education variable on the basis of the frequency of
14:44
occurrence. And this is how it would look like. Now if you wanted to reverse the order
14:49
we can use the fact reverse and that would reverse the order. Now in this case we didn't
14:57
change the order within the data frame rather we change the order while we are plotting
15:02
the data but we can change it within the data frame. So let's take the students data mutate
15:09
the education variable that would change the order use the fact in frequency and change
15:15
the levels on the basis of the frequency. If I store that and check the levels now now
15:23
we can see that the highest number of individuals are from BS then MS then the primary and secondary
15:30
have one from from each level of education. So that's how we order the frequency we order
15:39
the levels on the basis of the frequency of occurrence of those categories. If you want
15:45
to order on the basis of values coming from some other variable let's say we have this
15:50
marks variable let's take the students that group it by education and we want to summarize
15:57
the marks we want to calculate the mean marks for each level of education. So we want to
16:04
take all the students and let's say we have this we have created this fact summarized
16:11
so we have this level of educations and we have their mean marks. Now we want to plot
16:16
this we want to create a scatter plot where we take the students summarize data on x axis
16:25
we want the marks on y axis we want the level of education. Let's create this one. We have
16:32
this scatter plot but we cannot make sense of it right. But we would want it we would
16:38
want to reorder these these on the basis of the mean value. So so let's let's use the
16:49
same command but now what I have done is I have used the fact reorder education variable
16:55
on the basis of marks. So if I execute this now you can see it it is giving us a meaning
17:03
we want we can have a reverse order by just having this minus sign along with it. And
17:09
obviously as we did with the fact in frequency we can build these orders these levels within
17:18
our data frame. Let's move forward. And now in this case we are modifying the order but
17:24
what if we want to modify the factor levels we want to record them. Now there are two functions
17:32
that we are going to look into. First one is the factor record and the second one is
17:37
the factor collapse. So what factor record does is it records the the level of the level
17:45
of the categorical variable. So let's take the students data frame mutate create this
17:50
education variable which is created already but it would modify the levels fact record
17:58
the level of education right for the education variable for the education column. This is
18:05
the new category the new label that we want to assign it and this is the old label. So
18:09
what we want is we want to convert all the primary and secondary into non college based
18:15
education levels and we want BS and MS to be college. So then we would have just two
18:20
categories and let's save this into this new data frame. So let me show you this data frame
18:28
we now have instead of BS and MS we now have college and non college level of education
18:36
What if we wanted to collapse like in this case we are actually collapsing but but this
18:42
factory record can be used for different categories doesn't mean that we just have to recollab
18:47
we just have to collapse them. It could have been like P and then we could have given it
18:54
like us and this could have been B and this could have been M. Now in that case we wouldn't
19:02
be we wouldn't be collapsing them. So let me record it. Let me show you the data frame
19:10
and you can see that we have recorded the label the values. But what if we wanted to
19:16
collapse it for that the better function is factor collapse. What we do is we take the
19:21
education variable. This is this would be the new category and it would be assigned
19:27
to multiple these multiple categories. This again would be the new category assigned to
19:32
these old multiple categories. So in essence what we are doing is we are collapsing the
19:40
data based on based on create and also creating new new categories. So this is how it would
19:50
look like. So I hope this video was useful. Do subscribe to this channel. Do hit the bell
19:55
icon and thanks for watching this video