Standardization and Normalization in Stata
140K views
May 16, 2024
Standardization is a process of transforming a variable so it have a mean of zero and a standard deviation of 1. There are different methods of standardization i.e. z-score and mean centering. Lastly, we discuss normalization and we discuss a command called center this command is used as an easy method to standardize or normalize a variable. 00:00 Intro to topic 0:44 Z-score 4:17 Mean centering 5:21 Normalization Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel. In this video we are going to talk about how do you
0:04
standardize or normalize variable in Stata. So there are different ways of standardizing or
0:09
normalizing variable for example we can use z-score or mean centering or normalization
0:16
and in Stata we have the center command that is quite useful and an easy way of doing all this
0:23
process. So we will first discuss how do we do this manually and walk you through the process
0:30
and that would help us in learning the idea behind these methods and lastly we'll discuss
0:36
the center command that is easy and intuitive to use. So let's load this sysuse data and let's
0:44
start with how do we standardize a variable. Now standardization means that we detect the mean of
0:52
that specific variable from all of its values and divide them by its standard deviation so that way
0:59
we would have a variable that would have a mean value of zero and its standard deviation would
1:07
be equal to one. Now we do standardization so that we can bring all the different variables
1:12
on to a similar scale and that way they are easy for us to compare. So let's summarize and let's
1:19
look at this price variable. This is a data set related to auto data set that contains
1:24
different variables related to different cars we have their price their mileage their
1:30
their weight their length etc etc. Now there is this variable the price of these cars and we can
1:36
see that there are different price values right and let's say if we wanted to standardize this
1:43
what we would do is we would generate a new variable let's call it price new and that would
1:48
be equal to the price variable so what this would do is it would take each value of this price variable
1:55
minus the mean value which we we got it from here divided by the standard deviation of the price
2:02
variable which we got it from here. So if I execute this and if I can show you this from the data view
2:09
we can see that we got this new variable called price new and this is a standardized variable
2:15
and we we discussed that it would have a mean of zero and a standard deviation of one so this is
2:22
a mean of zero and now this is what we did a manual method just to give you an idea of what
2:29
process goes behind the standardization or z-score and let's do this easy method that we can use to
2:36
standardize a variable in startup. So we have this command which is called egen and let's create a
2:42
new variable let's call it price new one so we have a different variable than this one and let's
2:50
use this std function this std function is used to calculate the z-score of a variable so we want
2:56
starter to calculate the z-score of this price variable so if I execute this and this variable
3:02
would be created and it contains the exact same values as we have when we did the manual method
3:08
now there might be small changes because in the manual method we didn't had quite a lot of decimal
3:15
points precision but in this case obviously we do have a precision so let's summarize the price the
3:21
price new variable and the price new one variable that we created and you can see that all of both
3:28
of these variables do have a mean of zero and the standard deviation is one right let's move to how
3:36
do we standardize this how do we standardize using manus you can also do this process from manus
3:42
you'd have to click on data then create and change data then there is this option which is called
3:49
create a new variable extended and if we go over there you just write the name of the new variable
3:56
over in this checkbox in this in this text box and scroll way down where you see this standardized
4:03
value you can change the mean value or the standard deviation value if you want to but
4:08
if you just and you write the name of the variable that you want to change press ok and that would
4:13
create this standardized value let's move to the second part which is called mean centering what
4:20
in mean centering what we do is simply deduct the mean from the each value of the variable so we have
4:28
this price variable and we know its mean value is 2949 let's create a price center which is mean
4:37
centered and let's take each value of price variable and deduct the mean so instead of
4:43
writing it manually over here this value we are using the scalars remember we can access scalars
4:51
using this command return list and that would give us the name of the scalar that
4:56
stores the mean value okay so let's execute this and let's look at the mean value now what's mean
5:07
centering would do is it would not change the standard deviation but it would simply convert
5:12
the mean value to zero right deducting it mean value from each of the values of the price variable
5:18
would convert it into a zero mean value let's move to the third part which is which is called
5:25
normalization now normalization you know what it does is it it scales the variable to a fixed
5:32
range so typically the range is from from zero to one in normalization is typically used when
5:37
the scale of a variable is not known and when the variable has a non-uniform distribution
5:43
now this helps this variable to bring it to a same scale and it's again easier for us to compare now
5:50
what we do is not in normalization is we take the original variable minus the minimum value and
5:57
divide it by the range that is the the difference between the maximum and the minimum value so let's
6:03
do this process for the mpg variable let's execute summarize we have the mean and the minimum and the
6:12
maximum value we do not need mean in this case so let's create mpg1 and that would be our so that
6:20
would be mpg minus the minimum value divided by the range that is maximum minus minimum this is
6:29
how we can write command but i would rather use both of them which is the scalar method that i
6:36
discussed previously both of these would give us the same values that is their mean would be now
6:45
zero right last is how do we use the center command now the center command is the easy way to
6:52
use to standardize or normalize a variable we just tried the name of the command and name of
6:58
the variables that we want to standardize what it would do is it would create these variables
7:05
and add a prefix to them so for example by default it would centerize it would do mean
7:10
centering of the variable so let's compare this with our the variable that we generated and the
7:15
variable that is generated from the from the center command if we compare them you would see that we
7:21
get the exact same values we can also do standardization and we can also add prefix to each
7:30
of the variable that would be created so by default it added a prefix of c underscore but we can have
7:36
a different prefix so let's standardize this price variable and let's compare it
7:41
with the variable that we generated std underscore price and the variable that we generated using center command and the variable that we
7:52
generated previously using the manual method and you can see that they produces almost the same
7:57
result we can also generate change the name of the variable that would be generated so let's say we
8:03
want to generate variable but and instead of having this c underscore price let's have a new
8:09
name that we want to give it to now the nice thing is that we can use by sort for us of what it would
8:16
do is it would do mean centering for this price variable but instead of like doing the mean of
8:24
all the price values in the price variable it would do it for each category of the foreign variable
8:31
so there are a lot of different options that you can learn from this center command you can look
8:37
at the help menu and look at the different options that are there play with them and learn the standard
8:44
center command this is quite an easy way to standardize or normalize a variable so i hope
8:50
this video was useful do subscribe to this channel and do hit the bell icon thanks for watching this video
#Sport Scores & Statistics