Count Number of Observations by Group (Category) in R
5K views
May 16, 2024
In this video we discuss how to count a number of observations in the data frame. We also discuss the number of observations in each category of a categorical variable or a number of observations that full fill a condition. Lastly we discuss how to create (mutate) a variable in a datframe containg the number of observations. Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel
0:02
In this video we are going to talk about how do we count the number of observations in R
0:09
There are different things that we are going to discuss. How do we count the number of observations that are there in a data frame
0:17
How many number of observations do we have for each category in our categorical variables
0:24
Also kind of a tabulation making a tabulation or frequency table out of the categorical variable
0:33
So let's load this tidyverse package. If you haven't installed it then you would need to install this package using install
0:44
dot packages and then write the name of the package within the parenthesis and that would
0:50
install this package. Remember to use inverted commas. So I'm not going to install this package I'm just going to load it using the library function
1:04
because I need to use this diamonds data set in this tutorial
1:08
So I'm going to load this diamond data set using the data function and we have this data
1:15
set over here. So what this data contains are 53000 observations and there are 10 variables
1:23
It contains different parameters related to diamonds. So we have the caret whether it is an ideal premium or good it is a categorical then again
1:33
we have color which is a categorical variable. We have clarity depth price etc etc
1:41
So if you wanted to know about the number of rows this data frame contains then we would
1:49
use the end row and specify the data frame. We can also check the number of columns and or we can use this dimension function to get
2:00
the number of rows and as well as the number of columns. Now this is not what we are interested in this video because we can already know the
2:08
number of observations and number of variables from our from our environment
2:15
What we are mainly interested in the number of observations in each category
2:20
For example we have this this color category color variable. We want to know how many diamonds do we have within our data set for each for each category
2:33
of this color variable. So how many observations we have with the color of E how many observations do we have
2:40
with the color of I and so on and so forth. For that if you are going to use the base R function we would use table and so within
2:52
the table we would specify the column that we want. So we have we are going to access the diamonds data and using the dollar sign we are going
3:02
to specify the column. However this is how we specify a column within a data frame
3:11
So we want the stable functions to get the number of number of observations for each
3:19
category of the color variable. So if I press control enter now we get the number of observations
3:25
So this is just for a single category if we were to use say for example we also want to
3:33
know for each color and for each let's just say let's just say we have this cut variable
3:45
or clarity variable. So you want to know how many observations are there with the combination of color and clarity
3:53
So now we have two category color variable and we would get table. So let's let's let's use cut and that would give us a better understanding
4:03
So we have these these five categories within cut variable over here right and for each
4:13
cut category we want to know the number of observations for each color category
4:19
So it is a two dimensional tabulation right. So previously we look into the one dimensional table and this is a two dimensional table
4:32
We can keep on increasing the dimensions and let's just say let's add one more category
4:39
and let's use clarity over here and now because we have three dimensions what it would do
4:47
is it would first give us the you know the same two dimensional table using the first
4:56
category of clarity which is I1 then using the second category of clarity and so on
5:02
and so forth. But anyways I hope we do not need three levels of categorization
5:10
Let's use the tidyverse package and if you are going to use tidyverse package do load it
5:15
I have already loaded it so I don't need to load it again
5:19
But anyways I executed the command. So within this tidyverse package there is this useful function that is called count
5:29
Count performs the same task as table but it is somewhat more more tidy way of working
5:36
with data and it would get clear in a while. So we use the we use the count function and within count function the first parameter
5:48
would be the data set that we want to use. And after the comma we would specify the categorical variable using which we want to
5:55
generate the category. So if I press control enter you would see that we have this color categories and we
6:03
know the number of observation and you can see there is an underline around this thousand
6:09
digit right. Now these are not sorted and within table function we didn't had any any any parameter
6:18
to sort this using the number of frequency let's just say. If we if you use this sort set equal to true by default it is set equal to false
6:29
What it would do is it would sort based on the number of observations within each category
6:34
So previously it wasn't sorted based on the number of categories rather based on the based
6:43
on the the alphabets or the category names. But now it is sorted based on using this sort parameter it is sorted based on the number
6:54
of observations that we have within each. Now if you want to have a two dimension table what we would do is we would just add the
7:04
second dimension after the comma. So now we want to have the number of observations for each color and within each color for each
7:13
clarity right. So if I press control enter what I get now this is what makes it different from the table
7:19
function in table function we would get a table right and tables are fine but they're
7:28
not you know tidy way of working with things and once we move to this section it would
7:34
get clear to you why count is better way of working with with tabulations
7:40
So it gives us that within the d color we have this IE1 category clarity and there are
7:49
42 observations for the cross section of these two categories right and okay if you want
7:57
to increase the categories it would keep on increasing the column and you would get the
8:03
number of observations for each category right. Now why one more reason for using count is let's just say if you want to add a column
8:15
over here that would count the number of observations right. What we can do is we can take the diamonds data and pipe it this sign is used to pipe
8:30
whatever is left on the left side of this pipe operator towards the right side
8:35
So we take the diamond data group it by color right. This is same thing as over here but instead of just getting the data in the terminal we
8:46
want to create a column. So we are going to then pipe it into this mutate function and what we did does it creates
8:53
a new column we want to give the name as n color to this new column and what it would
9:02
contain is the number of observations. So if I execute this we are going to obviously store this operation within our diamonds data
9:10
and now we have this new column and we can see that within this color e category we have
9:18
9797 and 97 observations and that obviously repeats wherever we have the same category
9:28
and if you want to make sure that whether that is correctly done or not we can use this
9:33
count function again and we can see that for the e category of color we had 9797 observations
9:43
So we can add a column and obviously if you wanted to make it for each color and each cut
9:51
Let's rename this to another name and if we execute this we get the number of operations
10:02
within each color and each cut category. We can also count based on a condition
10:10
So let's take diamonds data pipe it into this filter function again from the tidal package
10:18
and what we want is we want to count if the cut is ideal and the color is e and then we
10:25
pipe this into this count function and this would just give us a single number suggesting
10:31
that within e category ideal cut we have three thousand nine hundred and three observations
10:39
So I hope this this video was useful if that is the case please subscribe to this channel
10:45
and do hit the like button
#Mathematics
#Statistics