Delete Duplicate Observations in R
246 views
May 16, 2024
This video explains the following topics: Identify duplicate observations How to drop (delete) duplicate observations How to Tag Duplicate Observations Generate a report of duplicate observations Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel
0:02
In today's video, we are going to talk about how do we identify and delete duplicate observations in R
0:10
So we are going to talk about how do we identify duplicate observations? How do we delete or drop these duplicate observations and only keep a single unique value for them
0:24
How do we tack duplicate observations? That is, we can create a column where we can
0:30
assign a value as a number of duplicates that we have in our data set for that specific value
0:38
And we can generate a report of duplicate observations. So let's get started. We have a sample
0:45
data frame over here. Let's say these data frame contains two columns that is ID and name
0:53
The ID column contains the registration number, and the name contains obviously the name assigned to that specific registration number
1:03
the name of the student. So we are creating this data frame. Okay, let's press control enter
1:10
And we have this student's data frame over here. We have six entries, their registration, and name
1:17
And as we can see in row 2 and row 5, we have an exact
1:23
duplicate observations in all other cases neither the name nor the ID variable contains
1:31
any duplicate observations. So how do we delete of first identify these duplicate
1:37
observations? Let's start with the duplicated function. So what this duplicated function would do is it would identify the duplicate entries, right? So if we
1:52
use the function, assign the data frame into it, press control enter, and it would give us a Boolean
2:00
characteristics that would either be false or true if that specific value is a duplicate
2:07
So we know that this Bob is the second last, which is a duplicate of this second row
2:16
So it gives us true for that observation for that specific row because that is a duplicate data in our data
2:26
But it isn't quite descriptive. We do not know which specific observations duplicate
2:35
We just know that there are certain duplicates in there if there is a true value
2:41
So we'll come back to that in a while. But before that, if we want to know how many duplicates do we have
2:49
we can wrap this duplicate duplicated function within this sum function and that would give us the number of duplicates we have
2:58
Similarly, if we wanted to know the exact location of that duplicate, so we can again wrap this
3:04
duplicated function within this which function and that would give us the row number, row index
3:12
at which that duplicate observation exists. Okay, so, That is how we get the number of duplicate observation
3:24
But what if we wanted to get the exact value Till now we haven identified I mean looking at the console we cannot say which specific student
3:38
which specific registration number is a duplicate. So what we can do is we can take this student's data frame
3:46
And within that, again, we can write this duplicated student's function. And if you are aware of that
3:55
So whenever we have this, we have a data frame, what we can do is within the square brackets, if we specify the first value would represent the column number
4:08
So if we specify five over here, comma nothing, that would give us this whole row
4:15
And similarly, if we were to specify one over here that would give us this whole column
4:22
But the column is not what we are interested in. This is what we are interested in
4:29
So one way of doing this would be to get the row number
4:34
and then assign that row number within this data frame and then we can get the row that is duplicate
4:44
But there is a more fancy way of doing it. And that is again within instead of writing 5
4:52
what we could have done is replace that five with this duplicate function
4:59
And that would give us the exact same result. If we were to check the unique observations
5:09
then instead of duplicate, we should have a prefix of not. And that would mean those rows that are not duplicate would give us those rows
5:22
that are not duplicate, right? But the, so what we have done is we have basically deleted
5:29
that duplicate row and if we were to, you know, assign this to a data frame, we can create a new data frame
5:40
that would now have five observations that are unique. But there is an easy way of deleting these duplicate observations
5:48
and that is what we are going to discuss now. We can use this unique function, right
5:54
So unique students, so the name of the data frame. And what this would do is it would just get us the unique observations
6:04
The same thing as we did with the duplicated and not duplicated syntax, but now it is quite
6:12
easy to work with. The result is exactly the same of both these way of writing the function
6:20
So then there is one more function which is called distinct that comes from the TIDAWest package
6:29
So if you haven't installed the TIDAW's package, you would first have to install it
6:33
I have already installed it. So I would just load that package into my environment using library and the name of the package
6:43
Now, the distinct function is, the syntax is again the same
6:52
We write the name of the function and specify the data frame and it gives us the unique observation This is exactly the same as the unique function
7:05
But the interesting thing with this distinct function is that, let's just say if we were to assign a specific column
7:16
and we wanted it to search duplicate observations in that specific columns, in that specific column or columns
7:25
then unique or duplicate would work. So even if we were to make them work
7:34
it would be quite a lot of code, right? So this thing is, so let me explain this using this new dataset
7:44
Let's call it all students. And what it contains is we have this program over here
7:53
So either a student is from a bachelor's program or a master's program. And within each program, we can have the same registration numbers
8:02
But this Alice is unique, does have unique identifier in a sense that it contains this unique
8:13
ID within BS program. But if you were to look in this whole data set, then this ID is not unique because
8:21
David and Alice does have the same ID. So what I mean to say is now we have two columns that would combinely identify a single observation
8:33
So in terms of a database, we have a compound primary key, right
8:42
Okay. So if you were to use this distinct function, and if you were to assign
8:55
Let's work with this all stone's database. It gives us unique value, rather it gives us all the values that it have
9:08
But although we know that this specific registration is repeated twice within BS program and again these
9:17
But in a sense that distinct looks in all the columns, there isn't a sense that a certain looks in all the columns
9:23
There isn't a row that whole row is duplicate, right? What if we wanted it to just look into the program column
9:35
and identify the duplicate the unique observations? And it did. We just have two unique observations in the program column
9:47
Now, it does drop all the other columns, right? What if we wanted it to have all the columns
9:53
then we would use this keep all equal true parameter. And then it would give us all
10:04
Now, what it had done is it had looked at the program column
10:08
and just got the first unique observation from the BS. But we know that program does not uniquely identify the observation
10:20
Rather, program and ID competently identified. a student. So now we get these five observations but we had six observations in a data And the reason for it to drop one observation is because this ID 1 or 2 is repeated twice
10:44
And now in terms of program and ID, these two are duplicate
10:51
So it would just keep a single observation, a single row. Right
10:58
So one way of writing this is instead of having this data frame
11:04
and obviously this goes for all the Tidavir's writing style, instead of this data frame being within the unique function
11:18
what we can do is we can get it out and pipe that into the distinct function
11:23
and that would give us the exact same. same results. Okay, let's move forward
11:30
How do we report duplicate observations? Let me get again with the students data
11:37
and I'm going to use this code where I'm going to filter
11:41
and then group them by ID name and then summarize the number
11:47
of counts, right? So it gives me that there you know, this
11:55
specific row that is ID and Bob, they have two number of observations
12:05
And obviously we just want the number of values for each duplicate
12:11
So let me just change this data and it might make more sense
12:18
right if we have let's just say uh this is again this last one is also bob let's just say and we have
12:32
let's say uh this alice so now we have two entries for alice and three entries for bob and now how would our
12:42
output look like if we were to uh it tells us that there are two entries for alice and
12:48
for the same ID and for the same name. And there are three entries for this specific row
12:56
How do we tag these observations? We can use this code. We can go into mutate and then have the number of observations that we have
13:06
And it would create a data frame. We are going to call it DF
13:10
Let me show you that data frame. We have a specific new column that contains whether
13:18
that entry is unique or not or sorry that contains whether that entry is the first or the
13:25
duplicated of the previous ones. So this is the first entry right. Again for Bob we know Bob
13:33
is duplicate but this one is the first entry and then Alice is repeated so it would have two
13:38
over here. Similarly Bob is repeated twice and tries. So now we have tagged these observations
13:45
whether they are duplicate or not. So anything that is greater than one would be in the row number column would be duplicate
13:54
So I hope that was useful. Stay tuned to this channel. Do subscribe and do hit the bell icon