How to Check Normality of a Variable in R
21K views
May 16, 2024
In this video we discuss two methods to check the normality of a variable i.e. Visualization (histogram, density plot, qqplot) and we have used statistical tests to check the normality (shapiro wilk, jarque bera, and skeweness kurtosits test. Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel. In this video, we are going to talk about
0:04
how do we check the normality of a series. So we are going to use two types of methods
0:13
So first one is visual ways to check whether the data or a series is normally distributed or not
0:20
So we're going to use a histogram and density plot and also QQ plot. And next we are going
0:27
to perform certain normality tests. So we have Shapiro-Wilk test and we have Skuner-Kutosis test
0:34
and then we have this Chark-Bara test that we are also going to look into this video
0:39
So let's load a data. This data comes pre-loaded with R. We are going to use this iris data. So
0:48
it contains the different species of flowers and their length, width, etc
0:55
If we look at this data, we have this sepal length and sepal width variable that we are going to work
1:01
with. So we're going to make the histogram. We use the hist function and we specify the
1:09
data frame and within that data frame which column we are going to use
1:14
So this gives us a series of observations. This is just the name or the heading of that
1:23
histogram that would be created. If I execute this, we get the heading which is coming from
1:29
this main parameter and it looks, I mean it looks a normal distribution although we have somewhat
1:39
the data is somewhat skewed towards the right. But it looks a bell-shaped curve and it is a
1:46
normal distribution. Let's also perform this histogram for sepal length. But sepal length
1:53
does not look quite normal distributed like it does have, you know, it also looks a bimodal data
2:07
Let's next look at density plot. So to create density plot, we use the plot function and within
2:13
that we would use the density function and the rest remains the same. We specify the column that
2:19
we want to use and then the heading that we would assign to the plot. So we do it for the sepal
2:25
width and from the histogram we know that sepal width had somewhat normal distribution and this
2:31
is what the density plot tells us. Although it is somewhat skewed to the right, if we do that
2:37
density plot for this sepal width, we can see that it is somewhat a, it doesn't look normal and
2:45
it doesn't look like it is a unimodal. So it have heavier tails, right. Next we are going to do
2:55
QQ plot. So what QQ plot is that, let's first execute QQ plot. So we are going to use this Q
3:02
norm and then the name of the column that we are going to create and then we are going to create
3:08
a 45 degree horizontal line. So what QQ plot is, if these observations are on this
3:16
horizontal line, then the closer they are to the horizontal line, the more normal of a distribution
3:22
it is. So let's compare it with the sepal lengths QQ plot and you know the sepal lengths QQ plot is
3:30
more not on this horizontal line and it is somewhat not normal distribution
3:39
Now this graphical presentation would give us some idea of how the distribution looks like but
3:47
there aren't some statistical tests. So we are going to perform certain statistical tests
3:52
and the first one is Shapiro-Wilk test. So we use Shapiro-Wilk test. So it is Shapiro.test
3:57
and specify the column name. So we look in at the p value. So the null hypothesis of this test
4:05
is that the data is normally distributed. The null hypothesis is normality. So if p value is not less
4:12
than 0.05 then we do not have enough evidence to reject the null hypothesis. So in this case we
4:19
can conclude because it is greater than 0.05 so we can conclude that sepal width is normally
4:28
distributed. Let's do that for sepal length and we know from our graphical visualization that
4:34
sepal length wasn't much normally distributed and this is what this p value would tell us
4:40
It is less than 0.05 so it rejects the null hypothesis in favor of the alternate hypothesis
4:46
which means that the data is not normally distributed. We are also going to use the
4:53
skewness at kurtosis but for that we are going to load the moments package. I have already installed
5:00
it but if you haven't installed it you would use install that packages and then you are going to
5:05
load the moments library. So within this moments library we have this skewness at kurtosis
5:12
We're going to check this skewness. So this skewness of so for for a data to be normally
5:19
distributed this skewness should be zero. In this case the skewness of the sepal width is
5:26
somewhat greater than zero right. Similarly the kurtosis should be equal to zero it is somewhat
5:31
greater than three sorry the kurtosis should be equal to three it is somewhat greater than three
5:37
but as compared to sepal length the kurtosis is way off than the normal distribution. I mean
5:47
you won't get the exact three value but the closer it is to the three the normal and the
5:53
closer this skewness is to zero the more normal the distribution is. Then there is this widely
6:01
used jarg barah test we use jarg.test and specify the column name again we would look at the p value
6:09
for the sepal width we can conclude that it is a normal distribution because the p value is greater
6:15
than 0.05. Let's do that for sepal length for sepal length according to jarg barah it is a
6:22
normal distribution but from all other evidences we we concluded that it is not a normal distribution
6:28
Do remember these tests are sensitive to the number of observations right so so you'd have
6:36
to perform certain tests certain visualization to to get an idea right so from all other tests we
6:42
know that sepal length is a non-normal distribution but sepal width seems to be a normal distribution
6:48
So I hope that was useful stay tuned to the channel do subscribe and do hit the bell icon