Standardization and Normalization in R
9K views
May 16, 2024
Standardization is a process of transforming a variable so it have a mean of zero and a standard deviation of 1. There are different methods of standardization i.e. z-score and mean centering. Lastly we discuss normalizaiton. 00:00 Intro to topic 0:34 Z-score 2:03 Normalization 3:40 Mean centering Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the DataHall youtube channel in this video we are going to talk about
0:03
how do you standardize a variable in R. Now standardization is a method that we
0:08
use to put different variables on a same scale so if you want to compare different variables that have different
0:15
scale for that what we do is we we standardize these variables. Now there are different ways of
0:21
standardization one is z-score then we have mean centering and then we have a different method that is called normalization. Now these are
0:29
what we are going to look into in how do we perform these in R. So let's
0:34
start with z-score. Let's create this data now what z-score does is we deduct the the mean value of a
0:42
variable from each of its value and divide each value by its standard deviation so that way we have
0:49
a variable that would have a mean of zero and standard deviation of one. Now for that we have this function which is called
0:57
scale and let's say we have this data set let's create this data set we have this data
1:03
we have this vector data called data and we want to scale it and save it into
1:09
this standard into this object called standard. What we do is we use this scale and if I can show you the mean of this
1:17
value variable or this vector it is zero and if I can show you the standard deviation or variance
1:24
it is equal to one. We can do the same process on a data frame so let's say we have this empty cars data frame
1:32
that contain different variables related to different variables related to car their mileage their their gear ratio etc etc
1:41
and let's say we want to standardize this mpg variable which stands for
1:46
for mileage again we use the scale variable and we take into account the column that we want to standardize
1:53
and what is the new column that would be created if I execute this
1:58
we we get this column in the in our data frame. Okay so let's move to the second part which is called
2:06
normalization. Now normalization is somewhat different from from calculating z-score. Z-score would have a standard deviation of one
2:16
and a mean of zero but in normalization we would scale a variable to a
2:21
fixed range so typically that range is from zero to one so for that we would have to install the stydevers package
2:31
because you do some data wrangling with the stydevers package now I have already installed it if you haven't you would use this command first
2:38
and then load this library using library and name of the the package that we want to load. Now
2:45
let's take the empty cars data and create mutate this new column that is called normal mpg and that would be equal to
2:56
each value of mpg minus the minimum value of this mpg column now this whole divided by the range which is the maximum of mpg minus
3:05
the minimum mpg if I do that we get we get a column that is that is called a normal mpg and that
3:16
is now normalized we can create box plot of our mpg column and and we can also create a box plot
3:27
of our standardized or normalized variable now that would give us if there are outliers these outliers would
3:34
also be taken care of this is what one of the benefits of standardization
3:39
or normalization. The last technique is called mean centering and in mean centering what we do is
3:45
we only deduct the mean of the variable from each value and do not divide it by standard deviation so that way the standard
3:55
deviation remains the same as of the original variable but its mean
3:59
is now zero so for that we take this displacement variable minus the mean of
4:05
this variable and I am setting this parameter equal to true so that it removes the missing value
4:12
so I have pressed ctrl enter and if I can show you the mean of
4:17
and standard deviation of the original variable and also the newly created variable now this is the mean of our original
4:27
variable and this is the mean of the normalized variable its value is
4:32
equal to zero mean value zero the standard deviation is one two three and it remains the same when we do mean centering but when we do
4:42
standardization the mean value would be zero and the standard deviation would be equal to one so I hope this video was useful do
4:49
subscribe to this channel and do hit the bell icon thanks for
4:53
watching this video